Slurm
The queueing system used at SciNet is based around the Slurm Workload Manager. This "scheduler", Slurm, determines which jobs will be run on which compute nodes, and when. This page outlines how to submit jobs, how to interact with the scheduler, and some of the most common Slurm commands.
Some common questions about the queuing system can be found on the FAQ as well.
Submitting jobs
You submit jobs from a Niagara login node. This is done by passing a script to the sbatch command:
nia-login07:~$ sbatch jobscript.sh
This puts the job, described by the job script, into the queue. The scheduler will will run the job on the compute nodes in due course. A typical submission script is as follows.
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks=80 #SBATCH --time=1:00:00 #SBATCH --job-name mpi_job #SBATCH --output=mpi_output_%j.txt cd $SLURM_SUBMIT_DIR module load intel/2018.2 module load openmpi/3.1.0 mpirun ./mpi_example # or "srun ./mpi_example"
Some notes about this example:
- The first line indicates that this is a bash script.
- Lines starting with
#SBATCH
go to SLURM. - sbatch reads these lines as a job request (which it gives the name
mpi_job
). - In this case, SLURM looks for 2 nodes with 40 cores on which to run 80 tasks, for 1 hour.
- Note that the mpifun flag "--ppn" (processors per node) is ignored. Slurm takes care of this detail.
- Once the scheduler finds a spot to run the job, it runs the script:
- It changes to the submission directory;
- Loads modules;
- Runs the
mpi_example
application.
- To use hyperthreading, just change --ntasks=80 to --ntasks=160, and add --bind-to none to the mpirun command (the latter is necessary for OpenMPI only, not when using IntelMPI).
To create a job script appropriate for your work, you must modify the commands above to instruct Slurm to run the commands you need run.
Things to remember
There are some things to always bear in mind when crafting your submission script:
- Scheduling is by node, so in multiples of 40 cores. You are expected to use all 40 cores! If you are running serial jobs, and need assistance bundling your work into multiples of 40, please see the serial jobs page.
- Jobs must write to your scratch or project directory (home is read-only on compute nodes).
- Compute nodes have no internet access. Download data you need before submitting your job.
- Jobs will run under your group's RRG allocation. If your group does not have an allocation, your job will run under your group's RAS allocation (previously called `default' allocation). Note that groups with an allocation cannot run under a default allocation.
- For users whose group has an allocation, the maximum walltime is 24 hours. For those without an allocation, the maximum walltime is 12 hours.
Scheduling hardware details
We now present the details of how to write a job script, and some extra commands which you might find useful.
SLURM nomenclature: jobs, nodes, tasks, cpus, cores, threads
SLURM has a somewhat different way of referring to things like MPI processes and thread tasks, as compared to our previous scheduler, MOAB. The SLURM nomenclature is reflected in the names of scheduler options (i.e., resource requests). SLURM strictly enforces those requests, so it is important to get this right.
term | meaning | SLURM term | related scheduler options |
---|---|---|---|
job | scheduled piece of work for which specific resources were requested. | job | sbatch, salloc |
node | basic computing component with several cores (40 for Niagara) that share memory | node | --nodes -N |
mpi process | one of a group of running programs using Message Passing Interface for parallel computing | task | --ntasks -n --ntasks-per-node |
core or physical cpu | A fully functional independent physical execution unit. | - | - |
logical cpu | An execution unit that the operating system can assign work to. Operating systems can be configured to overload physical cores with multiple logical cpus using hyperthreading. | cpu | --cpus-per-task |
thread | one of possibly multiple simultaneous execution paths within a program, which can share memory. | - | --cpus-per-task and OMP_NUM_THREADS |
hyperthread | a thread run in a collection of threads that is larger than the number of physical cores. | - | - |
Scheduling by Node
- On many systems that use SLURM, the scheduler will deduce from the job script specifications (the number of tasks and the number of cpus-per-node) what resources should be allocated. On Niagara, this is a bit different.
- All job resource requests on Niagara are scheduled as a multiple of nodes.
- The nodes that your jobs run on are exclusively yours.
- No other users are running anything on them.
- You can ssh into them, while your job is running, to see how things are going.
- Whatever you request of the scheduler, your request will always be translated into a multiple of nodes allocated to your job.
- Memory requests to the scheduler are of no use. Your job always gets N x 202GB of RAM, where N is the number of nodes. Each node has about 202GB of RAM available.
- You should try to use all the cores on the nodes allocated to your job. Since there are 40 cores per node, your job should use N x 40 cores. If this is not the case, we will be contacted you to help you optimize your workflow. Again, users which have serials jobs should consult the serial jobs page.
Hyperthreading: Logical CPUs vs. cores
Hyperthreading, a technology that leverages more of the physical hardware by pretending there are twice as many logical cores than real cores, is enabled on Niagara. The operating system and scheduler see 80 logical CPUs.
Using 80 logical CPUs versus 40 real cores typically gives about a 5-10% speedup, depending on your application (your mileage may vary).
Because Niagara is scheduled by node, hyperthreading is actually fairly easy to use:
- Ask for a certain number of nodes, N, for your job.
- You know that you get 40 x N cores, so you will use (at least) a total of 40 x N MPI processes or threads (mpirun, srun, and the OS will automaticallly spread these over the real cores).
- But you should also test if running 80 x N MPI processes or threads gives you any speedup.
- Regardless, your usage will be counted as 40 x N x (walltime in years).
Many applications which are communication-heavy can benefit from the use of hyperthreading.
Limits
There are limits to the size and duration of your jobs, the number of jobs you can run and the number of jobs you can have queued. It matters whether a user is part of a group with a Resources for Research Group allocation or not. It also matters in which 'partition' the jobs runs. 'Partitions' are SLURM-speak for use cases. You specify the partition with the -p parameter to sbatch or salloc, but if you do not specify one, your job will run in the compute partition, which is the most common case.
Usage | Partition | Running jobs | Submitted jobs (incl. running) | Min. size of jobs | Max. size of jobs | Min. walltime | Max. walltime |
---|---|---|---|---|---|---|---|
Compute jobs with an allocation | compute | 50 | 1000 | 1 node (40 cores) | 1000 nodes (40000 cores) | 15 minutes | 24 hours |
Compute jobs without allocation ("default") | compute | 50 | 200 | 1 node (40 cores) | 20 nodes (800 cores) | 15 minutes | 12 hours |
Testing or troubleshooting | debug | 1 | 1 | 1 node (40 cores) | 4 nodes (160 cores) | N/A | 1 hour |
Archiving or retrieving data in HPSS | archivelong | 2 per user (max 5 total) | 10 per user | N/A | N/A | 15 minutes | 72 hours |
Inspecting archived data, small archival actions in HPSS | archiveshort | 2 per user | 10 per user | N/A | N/A | 15 minutes | 1 hour |
Within these limits, jobs will still have to wait in the queue. The waiting time depends on many factors such as the allocation amount, how much allocation was used in the recent past, the number of nodes and the walltime, and how many other jobs are waiting in the queue.
SLURM Accounts
To be able to prioritise jobs based on groups and allocations, the SLURM scheduler uses the concept of accounts. Each group that has a Resource for Research Groups (RRG) or Research Platforms and Portals (RPP) allocation (awarded through an annual competition by Compute Canada) has an account that starts with rrg- or rpp-. SLURM assigns a 'fairshare' priority to these accounts based on the size of the award in core-years. Groups without an RRG or RPP can use Niagara using a so-called Rapid Access Service (RAS), and have an account that starts with def-.
On Niagara, most users will only ever use one account, and those users do not need to specify the account to SLURM. However, users that are part of collaborations may be able to use multiple accounts, i.e., that of their sponsor and that of their collaborator, but this mean that they need to select the right account when running jobs.
To select the account, just add
#SBATCH -A [account]
to the job scripts, or use the -A [account] to salloc or debugjob.
To see which accounts you have access to, or what their names are, use the command
sshare -U
Passing Variables to submission scripts
It is possible to pass values through environment variables into your SLURM submission scripts. For doing so with already defined variables in your shell, just add the following directive in the submission script,
#SBATCH --export=ALL
and you will have access to any predefined environment variable.
A better way is to specify explicitly which variables you want to pass into the submision script,
sbatch --export=i=15,j='test' jobscript.sbatch
You can even set the job name and output files using environment variables, eg.
i="simulation" j=14 sbatch --job-name=$i.$j.run --output=$i.$j.out --export=i=$i,j=$j jobscript.sbatch
(The latter only works on the command line; you cannot use environment variables in #SBATCH lines in the job script.)
Command line arguments:
Command line arguments can also be used in the same way as command line argument for shell scripts. All command line arguments given to sbatch that follow after the job script name, will be passed to the job script. In fact, SLURM will not look at any of these arguments, so you must place all sbatch arguments before the script name, e.g.:
sbatch -p debug jobscript.sbatch FirstArgument SecondArgument ...
In this example, -p debug is interpreted by SLURM, while in your submission script you can access FirstArgument, SecondArgument, etc., by referring to $1, $2, ...
.
Email Notification
Email notification works, but you need to add the email address and type of notification you may want to receive in your submission script, eg.
#SBATCH --mail-user=YOUR.email.ADDRESS #SBATCH --mail-type=ALL
The sbatch man page (type man sbatch on Niagara) explains all possible mail-types.
Example submission scripts
Here we present some examples of how to use
Example submission script (MPI)
#!/bin/bash #SBATCH --nodes=8 #SBATCH --ntasks=320 #SBATCH --time=1:00:00 #SBATCH --job-name mpi_job #SBATCH --output=mpi_output_%j.txt cd $SLURM_SUBMIT_DIR module load intel/2018.2 module load openmpi/3.1.0 mpirun ./mpi_example # or "srun ./mpi_example"
Submit this script with the command:
nia-login07:~$ sbatch mpi_job.sh
First line indicates that this is a bash script.
Lines starting with
#SBATCH
go to SLURM.sbatch reads these lines as a job request (which it gives the name
mpi_job
)In this case, SLURM looks for 8 nodes with 40 cores on which to run 320 tasks, for 1 hour.
Note that the mpifun flag "--ppn" (processors per node) is ignored.
Once it found such a node, it runs the script:
- Change to the submission directory;
- Loads modules;
- Runs the
mpi_example
application.
- To use hyperthreading, just change --ntasks=320 to --ntasks=640, and add --bind-to none to the mpirun command (the latter is necessary for OpenMPI only, not when using IntelMPI).
Example submission script (OpenMP)
#!/bin/bash #SBATCH --nodes=1 #SBATCH --cpus-per-task=40 #SBATCH --time=1:00:00 #SBATCH --job-name openmp_job #SBATCH --output=openmp_output_%j.txt cd $SLURM_SUBMIT_DIR module load intel/2018.2 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ./openmp_example # or "srun ./openmp_example".
Submit this script with the command:
nia-login07:~$ sbatch openmp_job.sh
- First line indicates that this is a bash script.
- Lines starting with
#SBATCH
go to SLURM. - sbatch reads these lines as a job request (which it gives the name
openmp_job
) . - In this case, SLURM looks for one node with 40 cores to be run inside one task, for 1 hour.
- Once it found such a node, it runs the script:
- Change to the submission directory;
- Loads modules;
- Sets an environment variable;
- Runs the
openmp_example
application.
- To use hyperthreading, just change
--cpus-per-task=40
to--cpus-per-task=80
.
Queues
Niagara
batch
The batch queue is the default queue on Niagara, allowing users access to all the resources for jobs between 15 minutes and 24 hours, for users whose groups have an allocation, and 12 hours for users with no allocation. If a specific queue is not specified, using the -p flag, then a job is automatically submitted to the batch queue. As a general rule, most jobs will run in the batch queue.
To submit a job to the queue
For example, to request two nodes anywhere on Niagara, use
#SBATCH -l nodes=2:ppn=8,walltime=1:00:00
For two nodes using DDR, use
#PBS -l nodes=2:ddr:ppn=8,walltime=1:00:00
To get two nodes using QDR, instead, you would use
#PBS -l nodes=2:qdr:ppn=8,walltime=1:00:00
debug
A debug queue has been set up primarily for code developers to quickly test and evaluate their codes and configurations without having to wait in the batch queue. There are 10 nodes currently reserved for the debug queue. It has quite restrictive limits to promote high turnover and availability thus a user can only use 2 nodes (16 cores) for 2 hours, to a maximum of 8 nodes (64 cores) for 1/2 an hour and can only have one job in the debug queue at a time. There is no minimum time limit on this queue.
$ qsub -l nodes=1:ppn=8,walltime=1:00:00 -q debug -I
Job Info
To see all jobs queued on a system use
$ showq
Three sections are shown; running, idle, and blocked. Idle jobs are commonly referred to as queued jobs as they meet all the requirements, however they are waiting for available resources. Blocked jobs are either caused by improper resource requests or more commonly by exceeding a user or groups allowable resources. For example if you are allowed to submit 10 jobs and you submit 20, the first 10 jobs will be submitted properly and either run right away or be queued, however the other 10 jobs will be blocked and the jobs won't be submitted to the queue until one of the first 10 finishes.
Available Resources
Determining when your job will run can be tricky as it involves a combination of queue type, node type, system reservations, and job priority. The following commands are provided to help you figure out what resources are currently available, however they may not tell you exactly when your job will run for the aforementioned reasons.
GPC
To show how many qdr nodes are currently free, use the show back fill command
$ showbf -f qdr
To show how many ddr nodes are free, use
$ showbf -f ddr
TCS
To show how many TCS nodes are free, use
$ showbf -c verylong
For example checking for a qdr job
$ showbf -f qdr Partition Tasks Nodes Duration StartOffset StartDate --------- ----- ----- ------------ ------------ -------------- ALL 14728 1839 7:36:23 00:00:00 00:23:37_09/24 ALL 256 30 INFINITY 00:00:00 00:23:37_09/24
shows that for jobs under 7:36:23 you can use 1839 nodes, but if you submit a job over that time only 30 will be available. In this case this is due to a large reservation made my SciNet staff, but from a users point of view, showbf tells you very simply what is available and at what time point. In this case, a user may wish to set #PBS -l walltime=7:30:00 in their script, or add -l walltime=7:30:00 to their qsub command in order to ensure that the jobs backfill the reserved nodes.
NOTE: showbf shows currently available nodes, however just because nodes are available doesn't mean that your job will start right away. Job priority, system reservations along with dedicated nodes, such as those for the debug queue, will alter when jobs run so even if enough nodes appear "free", it doesn't mean your job will actually run right away.
Job Submission
Interactive
On the GPC an interactive queue session can be requested using the following
$ qsub -l nodes=2:ppn=8,walltime=1:00:00 -I
In case you may experience longer than usual delays when requesting an interactive session, try specifying the "debug" queue:
$ qsub -l nodes=2:ppn=8,walltime=1:00:00 -I -q debug
Non-interactive (Batch)
For a non-interactive job submission you require a submission script formatted for the appropriate resource manger. Examples are provided for the GPC and TCS.
Job Status
$ qstat jobid
To see the status of all your jobs, use
showq -u username
or
qstat -u username
If your job appears to be blocked, you can try the following command
$ checkjob jobid
which gives more verbose and often a bit confusing output.
Cancel a Job
$ canceljob jobid
To cancel all of your jobs, try this command:
$ canceljob `showq | grep yourusername | cut -c -8`
or using the torque commands (these tend to work faster for larger number of jobs)
$ qdel `qstat | grep yourusername | cut -c -8`
Accounting
For any user with an NRAC/LRAC allocation, a special account with the Resource Allocation Project (RAP) identifier (RAPI) from Compute Canada Database (CCDB) is set up in order to access the allocated resources. Please use the following instructions to run your job using your special allocation. This is necessary both for accounting purposes as well as to assign the appropriate priority to your jobs.
Each job run on the system will have a default RAP associated with it. Most users already have their default RAP properly set. However, if you have more than one allocation (different RAPs), you may need/want to change your default RAP in order to charge your jobs to a particular RAP.
Changing your default RAP
- Go to the portal, login with your SciNet username and password.
- Click on "Change SciNet default RAP" and change your default RAP.
Specifying the RAP for GPC
Alternatively, you may want to assign a RAP for each particular job you run. There are two ways to specify an account for Moab/Torque: From the command line or inside the batch submission script.
Command line
Use the '-A RAPI' flag when you submit your job using qsub. Note that the command line option will override the submission script if an account is specified on both the submission script and the command line. "RAPI" is the RAP Identifier, e.g. abc-123-de.
Submission Script
Add a line in your submit script as follows:
#PBS -A RAPI
Please replace "RAPI" with your RAP Identifier.
Specifiying the RAP for TCS
Add a line in your submit script as follows:
# @ account_no = RAPI
Please replace "RAPI" with your RAP Identifier.
User Stats
Show current usage stats for a $USER
$ showstats -u $USER
Reservations
$ showres
Standard users can only see their reservations not other users or system ones. To determine what is available a user can use "showbf", it shows what resources are available and at what time level, taking into account running jobs and all the reservations. Refer to the Available Resources section of this page for more details.
Job Dependencies
Sometimes you may want one job not to start until another job finishes, however you would like to submit them both at the same time. This can be done using job dependencies on both the GPC and TCS, however the commands are different due to the underlying resource managers being different.
GPC
Use the -W flag with the following syntax in your submission script to have this job not start until the job with jobid has successfully finished
-W depend=afterok:jobid
This functionality also allows to add dependencies on several jobs, eg.
-W depend=afterok:jobid1:jobid2:...:jobidN
which in this case, the job will not start until all the jobs jobid1, jobid2, ..., jobidN have successfully finished. Note that the length of the string "afterok:jobid1:jobid2...:jobidN" cannot exceed 1024 characters.
More detailed syntax, dependency options and examples can be found
[here].
TCS
Loadleveler does job dependencies using what they call steps. See the TCS Quickstart guide for an example.
Adjusting Job Priority
The ability to adjust job priorities downwards can also be of use to adjust relative priorities of jobs between users who are running jobs of the same allocation (eg, a default or NRAC allocation of the same PI). Priorities are determined by how much of the time of that allocation been currently used, and all users using that account will have identical priorities. This mechanism allows users to voluntarily reduce their priority to allow other users of the same allocation to run ahead of them.
In principle, by adjusting a jobs priority downwards, you could reduce your jobs priority to the point that someone elses job entirely could go ahead of yours. In practice, however, this is extremely unlikely. Users with NRAC allocations have priorities that are extremely large positive numbers that depend on their allocation and how much of it they have already used during the past fairshare window (2 weeks); it is very unlikely that two groups would have priorities that are within 10 or 100 or 1000 of each other.
Note that at the moment, we do not allow priorities to go negative; they are integers that can go no lower than 1. (This may change in the future) That means that users of accounts that have already used their full allocation during the current fairshare period (eg, over the past two weeks), and so whose priority would normally be negative but is capped at 1, can not lower their priority any further. Similar, users with a `default' allocation have priority 1, and cannot lower their priorities any further.
GPC
Moab allows users to adjust their jobs' priority moderately downwards, with the -p flag; that is, on a qsub line
$ qsub ... -p -10 JOBID
or in a script
... #PBS -p -10 ..
The number used (-10 in the examples above) can be any negative number down to -1024.
The ability to adjust job priorities downwards can be useful when you are running a number of jobs and want some to enter the queue at higher priorities than others. Note that if you absolutely require some jobs to start before others, you could use job dependencies instead.
For a job that is currently queued, one can adjust its priority with
$ qalter -p -10 JOBID
Suspending a Running Job
Separate from, and in addition to, the ability to place a hold on a queued job, you may want to suspend a running job. For example, you may want to test the timing of events in a weakly coupled parallel environment.
GPC
To suspend a job:
qsig -s STOP <jobid>
and to start it again:
qsig -s CONT <jobid>.
Scripts are suspendable by default, so you don't need to add any signal handling for this to work. As far as we can tell, the result is identical to using fg and ctrl-Z (or kill -STOP <PID>) in an interactive run.
More about using (and trapping) signals can be found on the Using Signals page.
QDR Network Switch Affinity
The QDR network is globally 5:1 oversubscribed, but on switch it has full 1:1 cross-section whereas the DDR is completely 1:1 non-blocking. When a job is submitted to the GPC QDR nodes, the queuing system tries to fulfill the requirements with nodes on the same switch to improve network performance. If not enough nodes are available on the same switch to satisfy the job request, the queue will then use any available nodes. This behavior can be changed by using the submission flag " -l nodesetisoptional=false " which forces the queuing system to only run this job on the same switch, thus the job will stay queued until enough nodes on one switch are available to satisfy this request. Note that the maximum number of nodes on one switch is 30, so a request of greater than 30 nodes with this flag will never run.
qsub -l nodes=4:ppn=8,walltime=10:00,nodesetisoptional=false
Multiple Job Submissions
If you are doing doing batch processing of a number of similar jobs on the GPC, torque has a feature called job arrays that can be used to simplify this process. By using the "-t 0-N" option on the command line during job submission or putting it in the job script file, #PBS -t 0-N, torque will expand your single job submission into N+1 jobs and sets the environment variable PBS_ARRAYID equal to that jobs specific number, ie 0-N, for each job. This reduces the amount of calls to qsub, and can allow the user to have many less submission scripts. Job arrays also have the benefit of batching groups of jobs allowing commands like qalter, qdel, qhold to work on all or a subset of the job array jobs with one command, instead of having to run the command for each job.
In the following example, 10 jobs are submitted using a single command
qsub -t 0-9 jobscript.sh
and the submission script then modifies the job based on the PBS_ARRAYID.
#!/bin/bash #PBS -l nodes=1:ppn=8,walltime=10:00:00 #PBS -N array_jobs cd ${PBS_O_WORKDIR} mkdir job.${PBS_ARRAYID} cd job.${PBS_ARRAYID} echo "Running job ${PBS_ARRAYID}" mpirun -np 8 ./mycode >& array_job.${PBS_ARRAYID}.out
The JOBID and the job name both get the additional ARRAYID added onto them in the form of a hyphen, ie JOBID-ARRAYID. If for example you wanted to cancel all the jobs in a job array you would use "qdel JOBID", whereas if you wanted to cancel just one of the jobs you would use "qdel JOBID-ARRAYID".
See here and here for full details.
Job Tools and Scripts
The following list of commands and scripts are useful at the moment of monitoring and managing submissions and jobs. We have discussed them in a recent TechTalk. More details can be found here.
In addition, the following scripts are available:
diskUsage
informs about the user and group file system usage.
quota
offers a shorter version for the user.
qsum
similar to showq, summarizes by user.
jobperf jobID
informs about the performance per-node of a given job.
jobError <jobID | jobNAME>
displays on realtime the error output of a given job.
jobOutput <jobID | jobNAME>
displays on realtime the standard output of a given job.
jobcd <jobID | jobNAME>
allows users to quickly move into the working directory of a given job.
jobscript <jobID | jobNAME>
displays the submission script used when submitting a given job.
jobssh <jobID | jobNAME>
allows users to connect to the head-node of a given job.
jobtop <jobID | jobNAME>
allows users to "top" on the head-node of a given job.
jobtree [user]
displays the jobs tree of dependencies for a given user
jobdep <jobID>
displays the dependencies of a given job.
jobperf <jobID>
displays current performance statistics of a given job (must be running).
Monitoring Jobs
Checking the memory usage from jobs
In many occasions it can be really useful to take a look at how much memory your job is using while it is running. There a couple of ways to do so:
1) using some of the command line utilities we have developed, e.g: by using the jobperf or jobtop utilities, it will allow you to check the job performance and head's node utilization respectively.
2) ssh into the nodes where your job is being run and check for memory usage and system stats right there. For instance, trying the 'top' or 'free' commands, in those nodes.
Also, it always a good a idea and strongly encouraged to inspect the standard output-log and error-log generated for your job submissions. These files are named respectively: JobName.{o|e}jobIdNumber; where JobName is the name you gave to the job (via the '-N' PBS flag) and JobIdNumber is the id number of the job. These files are saved in the working directory after the job is finished, but they can be also accessed on real-time using the jobError and jobOutput command line utilities.
Other related topics to memory usage:
Using Ram Disk
Different Memory Configuration nodes
Monitoring Jobs in the Queue