Slurm
The queueing system used at SciNet is based around the Slurm Workload Manager. This "scheduler", Slurm, determines which jobs will be run on which compute nodes, and when. This page outlines how to submit jobs, how to interact with the scheduler, and some of the most common Slurm commands.
Some common questions about the queuing system can be found on the FAQ as well.
Submitting jobs
You submit jobs from a Niagara login node. This is done by passing a script to the sbatch command:
nia-login07:~$ sbatch jobscript.sh
This puts the job, described by the job script, into the queue. The scheduler will will run the job on the compute nodes in due course. A typical submission script is as follows.
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=40 #SBATCH --time=1:00:00 #SBATCH --job-name mpi_job #SBATCH --output=mpi_output_%j.txt #SBATCH --mail-type=FAIL cd $SLURM_SUBMIT_DIR module load intel/2018.2 module load openmpi/3.1.0 mpirun ./mpi_example # or "srun ./mpi_example"
Some notes about this example:
- The first line indicates that this is a bash script.
- Lines starting with
#SBATCH
go to SLURM. - sbatch reads these lines as a job request (which it gives the name
mpi_job
). - In this case, SLURM looks for 2 nodes with 40 cores on which to run 80 tasks, for 1 hour.
- Note that the mpifun flag "--ppn" (processors per node) is ignored. Slurm takes care of this detail.
- Once the scheduler finds a spot to run the job, it runs the script:
- It changes to the submission directory;
- Loads modules;
- Runs the
mpi_example
application.
- To use hyperthreading, just change --ntasks-per-node=40 to --ntasks-per-node=80, and add --bind-to none to the mpirun command (the latter is necessary for OpenMPI only, not when using IntelMPI).
To create a job script appropriate for your work, you must modify the commands above to instruct Slurm to run the commands you need run.
Things to remember
There are some things to always bear in mind when crafting your submission script:
- Scheduling is by node, so in multiples of 40 cores. You are expected to use all 40 cores! If you are running serial jobs, and need assistance bundling your work into multiples of 40, please see the serial jobs page.
- Jobs must write to your scratch or project directory (home is read-only on compute nodes).
- Compute nodes have no internet access. Download data you need before submitting your job.
- Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below).
- Jobs will run under your group's RRG allocation. If your group does not have an allocation, your job will run under your group's RAS allocation (previously called `default' allocation). Note that groups with an allocation cannot run under a default allocation.
- The maximum walltime for all users is 24 hours. The minimum and default walltime is 15 minutes.
Scheduling details
We now present the details of how to write a job script, and some extra commands which you might find useful.
SLURM nomenclature: jobs, nodes, tasks, cpus, cores, threads
SLURM has a somewhat different way of referring to things like MPI processes and thread tasks, as compared to our previous scheduler, MOAB. The SLURM nomenclature is reflected in the names of scheduler options (i.e., resource requests). SLURM strictly enforces those requests, so it is important to get this right.
term | meaning | SLURM term | related scheduler options |
---|---|---|---|
job | scheduled piece of work for which specific resources were requested. | job | sbatch, salloc |
node | basic computing component with several cores (40 for Niagara) that share memory | node | --nodes -N |
mpi process | one of a group of running programs using Message Passing Interface for parallel computing | task | --ntasks -n --ntasks-per-node |
core or physical cpu | A fully functional independent physical execution unit. | - | - |
logical cpu | An execution unit that the operating system can assign work to. Operating systems can be configured to overload physical cores with multiple logical cpus using hyperthreading. | cpu | --cpus-per-task |
thread | one of possibly multiple simultaneous execution paths within a program, which can share memory. | - | --cpus-per-task and OMP_NUM_THREADS |
hyperthread | a thread run in a collection of threads that is larger than the number of physical cores. | - | - |
Scheduling by Node
- On many systems that use SLURM, the scheduler will deduce from the job script specifications (the number of tasks and the number of cpus-per-node) what resources should be allocated. On Niagara, this is a bit different.
- All job resource requests on Niagara are scheduled as a multiple of nodes.
- The nodes that your jobs run on are exclusively yours.
- No other users are running anything on them.
- You can ssh into them, while your job is running, to see how things are going.
- Whatever you request of the scheduler, your request will always be translated into a multiple of nodes allocated to your job.
- Memory requests to the scheduler are of no use. Your job always gets N x 202GB of RAM, where N is the number of nodes. Each node has about 202GB of RAM available.
- You should try to use all the cores on the nodes allocated to your job. Since there are 40 cores per node, your job should use N x 40 cores. If this is not the case, we will be contacted you to help you optimize your workflow. Again, users which have serials jobs should consult the serial jobs page.
Hyperthreading: Logical CPUs vs. cores
Hyperthreading, a technology that leverages more of the physical hardware by pretending there are twice as many logical cores than real cores, is enabled on Niagara. The operating system and scheduler see 80 logical CPUs.
Using 80 logical CPUs versus 40 real cores typically gives about a 5-10% speedup, depending on your application (your mileage may vary).
Because Niagara is scheduled by node, hyperthreading is actually fairly easy to use:
- Ask for a certain number of nodes, N, for your job.
- You know that you get 40 x N cores, so you will use (at least) a total of 40 x N MPI processes or threads (mpirun, srun, and the OS will automaticallly spread these over the real cores).
- But you should also test if running 80 x N MPI processes or threads gives you any speedup.
- Regardless, your usage will be counted as 40 x N x (walltime in years).
Many applications which are communication-heavy can benefit from the use of hyperthreading.
Submission script details
This section outlines some details of how to interact with the scheduler, and how it implements Niagara's scheduling policies.
Queues
There are 3 queues available on SciNet systems. These queues have different limits; see the Limits section for further details.
Compute
The compute queue is the default queue. Most jobs will run in this queue. If no flags are specified in the submission script this is the queue where your job will land.
Debug
The Debug queue is a high-priority queue, used for short-term testing of your code. Do NOT use the debug queue for production work. You can use the debug queue one of two ways. To submit a standard job script to the debug queue, add the line
#SBATCH -p debug
to your submission script. This will put the job into the debug queue, and it should run in short order.
To request an interactive debug session, where you retain control over the command line prompt, at a login node type the command
nia-login07:~$ salloc -p debug --nodes 1 --time=1:00:00
This will request 1 node for 1 hour. You can similarly request a debug session using the 'debugjob' command:
nia-login07:~$ debugjob N
where N is the number of nodes, If N=1, this gives an interactive session one 1 hour, when N=4 (the maximum), it gives you 30 minutes.
Archive
The archivelong and archiveshort queues are only used by the HPSS system. See that page for details on how to use these queues.
Limits
There are limits to the size and duration of your jobs, the number of jobs you can run and the number of jobs you can have queued. It matters whether a user is part of a group with a Resources for Research Group allocation or not. It also matters in which 'partition' the jobs runs. 'Partitions' are SLURM-speak for use cases. You specify the partition with the -p parameter to sbatch or salloc, but if you do not specify one, your job will run in the compute partition, which is the most common case.
Usage | Partition | Running jobs | Submitted jobs (incl. running) | Min. size of jobs | Max. size of jobs | Min. walltime | Max. walltime |
---|---|---|---|---|---|---|---|
Compute jobs | compute | 50 | 1000 | 1 node (40 cores) | default: 20 nodes (800 cores) with allocation: 1000 nodes (40000 cores) |
15 minutes | 24 hours |
Testing or troubleshooting | debug | 1 | 1 | 1 node (40 cores) | 4 nodes (160 cores) | N/A | 1 hour |
Archiving or retrieving data in HPSS | archivelong | 2 per user (max 5 total) | 10 per user | N/A | N/A | 15 minutes | 72 hours |
Inspecting archived data, small archival actions in HPSS | archiveshort | 2 per user | 10 per user | N/A | N/A | 15 minutes | 1 hour |
Even if you respect these limits, your jobs will still have to wait in the queue. The waiting time depends on many factors such as your group's allocation amount, how much allocation has been used in the recent past, the number of requested nodes and walltime, and how many other jobs are waiting in the queue.
Slurm Accounts
To be able to prioritise jobs based on groups and allocations, the Slurm scheduler uses the concept of accounts. Each group that has a Resource for Research Groups (RRG) or Research Platforms and Portals (RPP) allocation (awarded through an annual competition by Compute Canada) has an account that starts with rrg- or rpp-. Slurm assigns a 'fairshare' priority to these accounts based on the size of the award in core-years. Groups without an RRG or RPP can use Niagara using a so-called Rapid Access Service (RAS), and have an account that starts with def-.
On Niagara, most users will only ever use one account, and those users do not need to specify the account to Slurm. However, users that are part of collaborations may be able to use multiple accounts, i.e., that of their sponsor and that of their collaborator, but this mean that they need to select the right account when running jobs.
To select the account, just add
#SBATCH -A [account]
to the job scripts, or use the -A [account] to salloc or debugjob.
To see which accounts you have access to, or what their names are, use the command
sshare -U
It has been noted that, in some cases, using the '-A' flag does not result in the appropriate account being used. To get around this, specify the account when sbatch is invoked:
sbatch -A account myjobscript.sh
Slurm environment variables
There are many environment variables built into Slurm. These are some which you may find useful:
- SLURM_SUBMIT_DIR: directory from which the job was submitted.
- SLURM_SUBMIT_HOST: host from which the job was submitted.
- SLURM_JOB_ID: the job's id.
- SLURM_JOB_NUM_NODES: number of nodes in the job.
- SLURM_JOB_NODELIST: list of nodes assigned to the job.
- SLURM_JOB_ACCOUNT: account associated with the job.
Any of these environment variables can be accessed from within your job script.
Passing Variables to submission scripts
It is possible to pass values through environment variables into your SLURM submission scripts. For doing so with already defined variables in your shell, just add the following directive in the submission script,
#SBATCH --export=ALL
and you will have access to any predefined environment variable.
A better way is to specify explicitly which variables you want to pass into the submision script,
sbatch --export=i=15,j='test' jobscript.sbatch
You can even set the job name and output files using environment variables, eg.
i="simulation" j=14 sbatch --job-name=$i.$j.run --output=$i.$j.out --export=i=$i,j=$j jobscript.sbatch
(The latter only works on the command line; you cannot use environment variables in #SBATCH lines in the job script.)
Command line arguments
Command line arguments can also be used for job script in the same way as command line argument for shell scripts. All command line arguments given to sbatch that follow after the job script name, will be passed to the job script. In fact, SLURM will not look at any of these arguments, so you must place all sbatch arguments before the script name, e.g.:
sbatch -p debug jobscript.sbatch FirstArgument SecondArgument ...
In this example, -p debug is interpreted by SLURM, while in your submission script you can access FirstArgument, SecondArgument, etc., by referring to $1, $2, ...
.
Job arrays
Sometimes you need to run the same job script many times, but just tweaking one value each time. One way of accomplishing this is using job arrays. Job arrays are invoked using the "-a" flag with sbatch:
sbatch -a 1-100 myjobscript.sh
This will submit 100 instances of myjobscript.sh. Within the job script you can distinguish which of those instances is running using the environment variable SLURM_ARRAY_TASK_ID.
Note that Niagara currently has a limit of 1000 submitted jobs for users within groups with allocations, and 200 submitted jobs without an allocation.
Job dependencies
You can make one job dependent on the successful completion of another job using the following command:
sbatch --dependency=afterok:JOBID myjobscript.sh
This will make the current job submission not start until the parent job, with jobid JOBID, successfully completes. There are many job dependency options available. Visit the Slurm sbatch page for the full list.
If the parent job fails (that is, ends with a non-zero exit code) the dependent job can never be scheduled and will be automatically cancelled.
Email Notification
Email notification works, but you need to add the email address and type of notification you may want to receive in your submission script, eg.
#SBATCH --mail-user=YOUR.email.ADDRESS #SBATCH --mail-type=ALL
The sbatch man page (type man sbatch on Niagara) explains all possible mail-types.
Job Location Constraints
Node types
With the expansion of Niagara there are now two node types, 1548 Intel 6148 "skylake" CPU based nodes, and 468 Intel 6248 "cascadelake" CPU based nodes. By default a job will be placed on the first available nodes but will not span node types. You can specify a node type using one of the following directives to your submission script.
#SBATCH --constraint=skylake #SBATCH --constraint=cascade
EDR/HDR Infiniband Topology
The Infiniband high speed network used for job communication and file I/O on Niagara consists of 5 1:1 subscribed "wings" that connected together in a dragonfly topology with adaptive routing enabled. 4 wings (dragonfly[1-4]) consist of EDR based skylakde nodes and dragonfly5 contains all the of HDR100 cascadelake nodes. By default multi-node jobs will run on the first available nodes which could be all within 1 wing, or span across multiple wings, but not across node types. For most scalable parallel programs the performance difference should not be very significant, however if you wish keep your jobs from spanning wings you can use the following.
#SBATCH --constraint=[dragonfly1|dragonfly2|dragonfly3|dragonfly4|dragonfly5]
Monitoring jobs
There are many options available for monitoring your jobs. The most basic of which is the squeue command:
nia-login07:~$ squeue -u USERNAME JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 292047 compute myjob4 username PD 0:00 4 (Priority) 292048 compute myjob3 username PD 0:00 4 (Priority) 266829 compute myjob2 username R 18:56:17 2 nia[1397-1398] 266828 compute myjob1 username R 18:56:46 1 nia1298
Here you can see that we have two running jobs ('R') and two pending jobs ('PD'). The nodes being used are listed.
Job status
To get an estimate of when a job will start, use the command
squeue --start -j JOBID
Note that this is only an estimate, and tends not to be very accurate.
Information about a specific job can be found using the
squeue -j JOBID
or alternatively
scontrol show job JOBID
which is more verbose.
SSHing to a node
Once your job has started, the node belongs to you. As such you may, from a login node, SSH into the node to check the performance of your job. The first step is to find out which nodes are being used (see above). Once you have your list of nodes, you can SSH into them directly. Once there, you can run the 'top' or 'free' commands to check both CPU and memory usage.
jobperf
The jobperf script will give you feedback on the performance of your currently-running job:
nia-login07:~$ jobperf 123456 ---------------------------------------------------------------------------------------------------- RUNNING IDLE USER MEMORY(MB) PROCESS NAMES HOSTNAME # %CPU %MEM DISK SLEEP NAME RAMDISK USED AVAIL (excl:bash,sh,ssh,sshd) ---------------------------------------------------------------------------------------------------- nia1013 71 6999% 0.5% 0 22 ejspence 0 15060 178017 14*gmx_mpi mpiexec slurm_script nia1014 79 7677% 0.1% 0 18 ejspence 0 14803 178274 13*gmx_mpi nia1295 79 7517% 0.4% 0 18 ejspence 0 15199 177878 13*gmx_mpi ----------------------------------------------------------------------------------------------------
Here you can see both the CPU and memory usage of the job, for all nodes being used.
Other commands
Some other commands had can be useful for dealing with your jobs:
scancel -i JOBID
cancels a specific job.sacct
gives information about your recent jobs.sinfo -p compute
gives a list of available nodes.qsum
gives a summary of the queue by user.
Example submission scripts
Here we present some examples of how to create submission scripts for running parallel jobs. Serial job examples can be found on the serial jobs page.
Example submission script (MPI)
#!/bin/bash #SBATCH --nodes=8 #SBATCH --ntasks-per-node=40 #SBATCH --time=1:00:00 #SBATCH --job-name mpi_job #SBATCH --output=mpi_output_%j.txt #SBATCH --mail-type=FAIL cd $SLURM_SUBMIT_DIR module load intel/2018.2 module load openmpi/3.1.0 mpirun ./mpi_example # or "srun ./mpi_example"
Submit this script with the command:
nia-login07:~$ sbatch mpi_job.sh
First line indicates that this is a bash script.
Lines starting with
#SBATCH
go to SLURM.sbatch reads these lines as a job request (which it gives the name
mpi_job
)In this case, SLURM looks for 8 nodes with 40 cores on which to run 320 tasks, for 1 hour.
Note that the mpifun flag "--ppn" (processors per node) is ignored.
Once it found such a node, it runs the script:
- Change to the submission directory;
- Loads modules;
- Runs the
mpi_example
application.
- To use hyperthreading, just change --ntasks-per-node=40 to --ntasks-per-node=80, and add --bind-to none to the mpirun command (the latter is necessary for OpenMPI only, not when using IntelMPI).
Example submission script (OpenMP)
#!/bin/bash #SBATCH --nodes=1 #SBATCH --cpus-per-task=40 #SBATCH --time=1:00:00 #SBATCH --job-name openmp_job #SBATCH --output=openmp_output_%j.txt #SBATCH --mail-type=FAIL cd $SLURM_SUBMIT_DIR module load intel/2018.2 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ./openmp_example # or "srun ./openmp_example".
Submit this script with the command:
nia-login07:~$ sbatch openmp_job.sh
- First line indicates that this is a bash script.
- Lines starting with
#SBATCH
go to SLURM. - sbatch reads these lines as a job request (which it gives the name
openmp_job
) . - In this case, SLURM looks for one node with 40 cores to be run inside one task, for 1 hour.
- Once it found such a node, it runs the script:
- Change to the submission directory;
- Loads modules;
- Sets an environment variable;
- Runs the
openmp_example
application.
- To use hyperthreading, just change
--cpus-per-task=40
to--cpus-per-task=80
.