Slurm

From SciNet Users Documentation
Revision as of 17:16, 2 August 2018 by Ejspence (talk | contribs)
Jump to navigation Jump to search

The queueing system used at SciNet is based around the Slurm Workload Manager. This "scheduler", Slurm, determines which jobs will be run on which compute nodes, and when. This page outlines how to submit jobs, how to interact with the scheduler, and some of the most common Slurm commands.

Some common questions about the queuing system can be found on the FAQ as well.

Submitting jobs

You submit jobs from a Niagara login node. This is done by passing a script to the sbatch command:

nia-login07:~$ sbatch jobscript.sh

This puts the job, described by the job script, into the queue. The scheduler will will run the job on the compute nodes in due course. A typical submission script is as follows.

#!/bin/bash 
#SBATCH --nodes=2
#SBATCH --ntasks=80
#SBATCH --time=1:00:00
#SBATCH --job-name mpi_job
#SBATCH --output=mpi_output_%j.txt

cd $SLURM_SUBMIT_DIR

module load intel/2018.2
module load openmpi/3.1.0

mpirun ./mpi_example
# or "srun ./mpi_example"

Some notes about this example:

  • The first line indicates that this is a bash script.
  • Lines starting with #SBATCH go to SLURM.
  • sbatch reads these lines as a job request (which it gives the name mpi_job).
  • In this case, SLURM looks for 2 nodes with 40 cores on which to run 80 tasks, for 1 hour.
  • Note that the mpifun flag "--ppn" (processors per node) is ignored. Slurm takes care of this detail.
  • Once the scheduler finds a spot to run the job, it runs the script:
    • It changes to the submission directory;
    • Loads modules;
    • Runs the mpi_example application.
  • To use hyperthreading, just change --ntasks=80 to --ntasks=160, and add --bind-to none to the mpirun command (the latter is necessary for OpenMPI only, not when using IntelMPI).

To create a job script appropriate for your work, you must modify the commands above to instruct Slurm to run the commands you need run.

Things to remember

There are some things to always bear in mind when crafting your submission script:

  • Scheduling is by node, so in multiples of 40 cores. You are expected to use all 40 cores! If you are running serial jobs, and need assistance bundling your work into multiples of 40, please see the serial jobs page.
  • Jobs must write to your scratch or project directory (home is read-only on compute nodes).
  • Compute nodes have no internet access. Download data you need before submitting your job.
  • Jobs will run under your group's RRG allocation. If your group does not have an allocation, your job will run under your group's RAS allocation (previously called `default' allocation). Note that groups with an allocation cannot run under a default allocation.
  • For users whose group has an allocation, the maximum walltime is 24 hours. For those without an allocation, the maximum walltime is 12 hours.


Scheduling details

We now present the details of how to write a jobs script, and extra commands which you might find useful.

SLURM nomenclature: jobs, nodes, tasks, cpus, cores, threads

SLURM has a somewhat different way of referring to things like MPI processes and thread tasks, as compared to our previous scheduler, MOAB. The SLURM nomenclature is reflected in the names of scheduler options (i.e., resource requests). SLURM strictly enforces those requests, so it is important to get this right.

term meaning SLURM term related scheduler options
job scheduled piece of work for which specific resources were requested. job sbatch, salloc
node basic computing component with several cores (40 for Niagara) that share memory node --nodes -N
mpi process one of a group of running programs using Message Passing Interface for parallel computing task --ntasks -n --ntasks-per-node
core or physical cpu A fully functional independent physical execution unit. - -
logical cpu An execution unit that the operating system can assign work to. Operating systems can be configured to overload physical cores with multiple logical cpus using hyperthreading. cpu --cpus-per-task
thread one of possibly multiple simultaneous execution paths within a program, which can share memory. - --cpus-per-task and OMP_NUM_THREADS
hyperthread a thread run in a collection of threads that is larger than the number of physical cores. - -

Scheduling by Node

  • On many systems that use SLURM, the scheduler will deduce from the specifications of the number of tasks and the number of cpus-per-node, what resources should be allocated. On Niagara, this is a bit different.
  • All job resource requests on Niagara are scheduled as a multiple of nodes.

  • The nodes that your jobs run on are exclusively yours.
    • No other users are running anything on them.
    • You can ssh into them to see how things are going.
  • Whatever your requests to the scheduler, it will always be translated into a multiple of nodes allocated to your job.

  • Memory requests to the scheduler are of no use. Your job always gets N x 202GB of RAM, where N is the number of nodes.

  • You should try to use all the cores on the nodes allocated to your job. Since there are 40 cores per node, your job should use N x 40 cores. If this is not the case, we will be contacted you to help you optimize your workflow.

Hyperthreading: Logical CPUs vs. cores

Hyperthreading, a technology that leverages more of the physical hardware by pretending there are twice as many logical cores than real once, is enabled on Niagara. So the OS and scheduler see 80 logical cpus.

Using 80 logical cpus vs. 40 real cores typically gives about a 5-10% speedup (Your Mileage May Vary).

Because Niagara is scheduled by node, hyperthreading is actually fairly easy to use:

  • Ask for a certain number of nodes N for your jobs.
  • You know that you get 40xN cores, so you will use (at least) a total of 40xN mpi processes or threads. (mpirun, srun, and the OS will automaticallly spread these over the real cores)
  • But you should also test if running 80xN mpi processes or threads gives you any speedup.
  • Regardless, your usage will be counted as 40xNx(walltime in years).

Limits

There are limits to the size and duration of your jobs, the number of jobs you can run and the number of jobs you can have queued. It matters whether a user is part of a group with a Resources for Research Group allocation or not. It also matters in which 'partition' the jobs runs. 'Partitions' are SLURM-speak for use cases. You specify the partition with the -p parameter to sbatch or salloc, but if you do not specify one, your job will run in the compute partition, which is the most common case.

Usage Partition Running jobs Submitted jobs (incl. running) Min. size of jobs Max. size of jobs Min. walltime Max. walltime
Compute jobs with an allocation compute 50 1000 1 node (40 cores) 1000 nodes (40000 cores) 15 minutes 24 hours
Compute jobs without allocation ("default") compute 50 200 1 node (40 cores) 20 nodes (800 cores) 15 minutes 12 hours
Testing or troubleshooting debug 1 1 1 node (40 cores) 4 nodes (160 cores) N/A 1 hour
Archiving or retrieving data in HPSS archivelong 2 per user (max 5 total) 10 per user N/A N/A 15 minutes 72 hours
Inspecting archived data, small archival actions in HPSS archiveshort 2 per user 10 per user N/A N/A 15 minutes 1 hour

Within these limits, jobs will still have to wait in the queue. The waiting time depends on many factors such as the allocation amount, how much allocation was used in the recent past, the number of nodes and the walltime, and how many other jobs are waiting in the queue.

SLURM Accounts

To be able to prioritise jobs based on groups and allocations, the SLURM scheduler uses the concept of accounts. Each group that has a Resource for Research Groups (RRG) or Research Platforms and Portals (RPP) allocation (awarded through an annual competition by Compute Canada) has an account that starts with rrg- or rpp-. SLURM assigns a 'fairshare' priority to these accounts based on the size of the award in core-years. Groups without an RRG or RPP can use Niagara using a so-called Rapid Access Service (RAS), and have an account that starts with def-.

On Niagara, most users will only ever use one account, and those users do not need to specify the account to SLURM. However, users that are part of collaborations may be able to use multiple accounts, i.e., that of their sponsor and that of their collaborator, but this mean that they need to select the right account when running jobs.

To select the account, just add

   #SBATCH -A [account]

to the job scripts, or use the -A [account] to salloc or debugjob.

To see which accounts you have access to, or what their names are, use the command

   sshare -U

Passing Variables to submission scripts

It is possible to pass values through environment variables into your SLURM submission scripts. For doing so with already defined variables in your shell, just add the following directive in the submission script,

#SBATCH --export=ALL

and you will have access to any predefined environment variable.

A better way is to specify explicitly which variables you want to pass into the submision script,

sbatch --export=i=15,j='test' jobscript.sbatch

You can even set the job name and output files using environment variables, eg.

i="simulation"
j=14
sbatch --job-name=$i.$j.run --output=$i.$j.out --export=i=$i,j=$j jobscript.sbatch

(The latter only works on the command line; you cannot use environment variables in #SBATCH lines in the job script.)

Command line arguments:

Command line arguments can also be used in the same way as command line argument for shell scripts. All command line arguments given to sbatch that follow after the job script name, will be passed to the job script. In fact, SLURM will not look at any of these arguments, so you must place all sbatch arguments before the script name, e.g.:

sbatch  -p debug  jobscript.sbatch  FirstArgument SecondArgument ...

In this example, -p debug is interpreted by SLURM, while in your submission script you can access FirstArgument, SecondArgument, etc., by referring to $1, $2, ....

Email Notification

Email notification works, but you need to add the email address and type of notification you may want to receive in your submission script, eg.

   #SBATCH --mail-user=YOUR.email.ADDRESS
   #SBATCH --mail-type=ALL

The sbatch man page (type man sbatch on Niagara) explains all possible mail-types.

Example submission scripts

Here we present some examples of how to use

Example submission script (MPI)

#!/bin/bash 
#SBATCH --nodes=8
#SBATCH --ntasks=320
#SBATCH --time=1:00:00
#SBATCH --job-name mpi_job
#SBATCH --output=mpi_output_%j.txt

cd $SLURM_SUBMIT_DIR

module load intel/2018.2
module load openmpi/3.1.0

mpirun ./mpi_example
# or "srun ./mpi_example"

Submit this script with the command:

   nia-login07:~$ sbatch mpi_job.sh
  • First line indicates that this is a bash script.

  • Lines starting with #SBATCH go to SLURM.

  • sbatch reads these lines as a job request (which it gives the name mpi_job)

  • In this case, SLURM looks for 8 nodes with 40 cores on which to run 320 tasks, for 1 hour.

  • Note that the mpifun flag "--ppn" (processors per node) is ignored.

  • Once it found such a node, it runs the script:

    • Change to the submission directory;
    • Loads modules;
    • Runs the mpi_example application.
  • To use hyperthreading, just change --ntasks=320 to --ntasks=640, and add --bind-to none to the mpirun command (the latter is necessary for OpenMPI only, not when using IntelMPI).

Example submission script (OpenMP)

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=40
#SBATCH --time=1:00:00
#SBATCH --job-name openmp_job
#SBATCH --output=openmp_output_%j.txt

cd $SLURM_SUBMIT_DIR

module load intel/2018.2

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./openmp_example
# or "srun ./openmp_example".

Submit this script with the command:

   nia-login07:~$ sbatch openmp_job.sh
  • First line indicates that this is a bash script.
  • Lines starting with #SBATCH go to SLURM.
  • sbatch reads these lines as a job request (which it gives the name openmp_job) .
  • In this case, SLURM looks for one node with 40 cores to be run inside one task, for 1 hour.
  • Once it found such a node, it runs the script:
    • Change to the submission directory;
    • Loads modules;
    • Sets an environment variable;
    • Runs the openmp_example application.
  • To use hyperthreading, just change --cpus-per-task=40 to --cpus-per-task=80.






Queues

Niagara

batch

The batch queue is the default queue on Niagara, allowing users access to all the resources for jobs between 15 minutes and 24 hours, for users whose groups have an allocation, and 12 hours for users with no allocation. If a specific queue is not specified, using the -p flag, then a job is automatically submitted to the batch queue. As a general rule, most jobs will run in the batch queue.

To submit a job to the queue

For example, to request two nodes anywhere on Niagara, use

#SBATCH  -l nodes=2:ppn=8,walltime=1:00:00

For two nodes using DDR, use

#PBS -l nodes=2:ddr:ppn=8,walltime=1:00:00

To get two nodes using QDR, instead, you would use

#PBS -l nodes=2:qdr:ppn=8,walltime=1:00:00
debug

A debug queue has been set up primarily for code developers to quickly test and evaluate their codes and configurations without having to wait in the batch queue. There are 10 nodes currently reserved for the debug queue. It has quite restrictive limits to promote high turnover and availability thus a user can only use 2 nodes (16 cores) for 2 hours, to a maximum of 8 nodes (64 cores) for 1/2 an hour and can only have one job in the debug queue at a time. There is no minimum time limit on this queue.

$ qsub -l nodes=1:ppn=8,walltime=1:00:00 -q debug -I


Job Info

To see all jobs queued on a system use

$ showq

Three sections are shown; running, idle, and blocked. Idle jobs are commonly referred to as queued jobs as they meet all the requirements, however they are waiting for available resources. Blocked jobs are either caused by improper resource requests or more commonly by exceeding a user or groups allowable resources. For example if you are allowed to submit 10 jobs and you submit 20, the first 10 jobs will be submitted properly and either run right away or be queued, however the other 10 jobs will be blocked and the jobs won't be submitted to the queue until one of the first 10 finishes.

Available Resources

Determining when your job will run can be tricky as it involves a combination of queue type, node type, system reservations, and job priority. The following commands are provided to help you figure out what resources are currently available, however they may not tell you exactly when your job will run for the aforementioned reasons.

GPC

To show how many qdr nodes are currently free, use the show back fill command

$ showbf -f qdr

To show how many ddr nodes are free, use

$ showbf -f ddr

TCS

To show how many TCS nodes are free, use

$ showbf -c verylong

For example checking for a qdr job

$ showbf -f qdr
Partition     Tasks  Nodes      Duration   StartOffset       StartDate
---------     -----  -----  ------------  ------------  --------------
ALL           14728   1839       7:36:23      00:00:00  00:23:37_09/24
ALL             256     30      INFINITY      00:00:00  00:23:37_09/24

shows that for jobs under 7:36:23 you can use 1839 nodes, but if you submit a job over that time only 30 will be available. In this case this is due to a large reservation made my SciNet staff, but from a users point of view, showbf tells you very simply what is available and at what time point. In this case, a user may wish to set #PBS -l walltime=7:30:00 in their script, or add -l walltime=7:30:00 to their qsub command in order to ensure that the jobs backfill the reserved nodes.

NOTE: showbf shows currently available nodes, however just because nodes are available doesn't mean that your job will start right away. Job priority, system reservations along with dedicated nodes, such as those for the debug queue, will alter when jobs run so even if enough nodes appear "free", it doesn't mean your job will actually run right away.

Job Submission

Interactive

On the GPC an interactive queue session can be requested using the following

$ qsub -l nodes=2:ppn=8,walltime=1:00:00 -I

In case you may experience longer than usual delays when requesting an interactive session, try specifying the "debug" queue:

$ qsub -l nodes=2:ppn=8,walltime=1:00:00 -I -q debug

Non-interactive (Batch)

For a non-interactive job submission you require a submission script formatted for the appropriate resource manger. Examples are provided for the GPC and TCS.

Job Status

$ qstat jobid

To see the status of all your jobs, use

showq -u username

or

qstat -u username

If your job appears to be blocked, you can try the following command

$ checkjob jobid

which gives more verbose and often a bit confusing output.

Cancel a Job

$ canceljob jobid

To cancel all of your jobs, try this command:

$ canceljob `showq | grep yourusername | cut -c -8` 

or using the torque commands (these tend to work faster for larger number of jobs)

$ qdel `qstat | grep yourusername | cut -c -8` 

Accounting

For any user with an NRAC/LRAC allocation, a special account with the Resource Allocation Project (RAP) identifier (RAPI) from Compute Canada Database (CCDB) is set up in order to access the allocated resources. Please use the following instructions to run your job using your special allocation. This is necessary both for accounting purposes as well as to assign the appropriate priority to your jobs.

Each job run on the system will have a default RAP associated with it. Most users already have their default RAP properly set. However, if you have more than one allocation (different RAPs), you may need/want to change your default RAP in order to charge your jobs to a particular RAP.

Changing your default RAP

  1. Go to the portal, login with your SciNet username and password.
  2. Click on "Change SciNet default RAP" and change your default RAP.

Specifying the RAP for GPC

Alternatively, you may want to assign a RAP for each particular job you run. There are two ways to specify an account for Moab/Torque: From the command line or inside the batch submission script.

Command line

Use the '-A RAPI' flag when you submit your job using qsub. Note that the command line option will override the submission script if an account is specified on both the submission script and the command line. "RAPI" is the RAP Identifier, e.g. abc-123-de.

Submission Script

Add a line in your submit script as follows:

#PBS -A RAPI

Please replace "RAPI" with your RAP Identifier.

Specifiying the RAP for TCS

Add a line in your submit script as follows:

# @ account_no = RAPI

Please replace "RAPI" with your RAP Identifier.


User Stats

Show current usage stats for a $USER

$ showstats -u $USER

Reservations

$ showres

Standard users can only see their reservations not other users or system ones. To determine what is available a user can use "showbf", it shows what resources are available and at what time level, taking into account running jobs and all the reservations. Refer to the Available Resources section of this page for more details.

Job Dependencies

Sometimes you may want one job not to start until another job finishes, however you would like to submit them both at the same time. This can be done using job dependencies on both the GPC and TCS, however the commands are different due to the underlying resource managers being different.

GPC

Use the -W flag with the following syntax in your submission script to have this job not start until the job with jobid has successfully finished

-W depend=afterok:jobid

This functionality also allows to add dependencies on several jobs, eg.

-W depend=afterok:jobid1:jobid2:...:jobidN

which in this case, the job will not start until all the jobs jobid1, jobid2, ..., jobidN have successfully finished. Note that the length of the string "afterok:jobid1:jobid2...:jobidN" cannot exceed 1024 characters.


More detailed syntax, dependency options and examples can be found [here].

TCS

Loadleveler does job dependencies using what they call steps. See the TCS Quickstart guide for an example.

Adjusting Job Priority

The ability to adjust job priorities downwards can also be of use to adjust relative priorities of jobs between users who are running jobs of the same allocation (eg, a default or NRAC allocation of the same PI). Priorities are determined by how much of the time of that allocation been currently used, and all users using that account will have identical priorities. This mechanism allows users to voluntarily reduce their priority to allow other users of the same allocation to run ahead of them.

In principle, by adjusting a jobs priority downwards, you could reduce your jobs priority to the point that someone elses job entirely could go ahead of yours. In practice, however, this is extremely unlikely. Users with NRAC allocations have priorities that are extremely large positive numbers that depend on their allocation and how much of it they have already used during the past fairshare window (2 weeks); it is very unlikely that two groups would have priorities that are within 10 or 100 or 1000 of each other.

Note that at the moment, we do not allow priorities to go negative; they are integers that can go no lower than 1. (This may change in the future) That means that users of accounts that have already used their full allocation during the current fairshare period (eg, over the past two weeks), and so whose priority would normally be negative but is capped at 1, can not lower their priority any further. Similar, users with a `default' allocation have priority 1, and cannot lower their priorities any further.

GPC

Moab allows users to adjust their jobs' priority moderately downwards, with the -p flag; that is, on a qsub line

$ qsub ... -p -10  JOBID

or in a script

...
#PBS -p -10
..

The number used (-10 in the examples above) can be any negative number down to -1024.

The ability to adjust job priorities downwards can be useful when you are running a number of jobs and want some to enter the queue at higher priorities than others. Note that if you absolutely require some jobs to start before others, you could use job dependencies instead.


For a job that is currently queued, one can adjust its priority with

$ qalter -p -10 JOBID


Suspending a Running Job

Separate from, and in addition to, the ability to place a hold on a queued job, you may want to suspend a running job. For example, you may want to test the timing of events in a weakly coupled parallel environment.

GPC

To suspend a job:

qsig -s STOP <jobid>

and to start it again:

qsig -s CONT <jobid>.

Scripts are suspendable by default, so you don't need to add any signal handling for this to work. As far as we can tell, the result is identical to using fg and ctrl-Z (or kill -STOP <PID>) in an interactive run.

More about using (and trapping) signals can be found on the Using Signals page.


QDR Network Switch Affinity

The QDR network is globally 5:1 oversubscribed, but on switch it has full 1:1 cross-section whereas the DDR is completely 1:1 non-blocking. When a job is submitted to the GPC QDR nodes, the queuing system tries to fulfill the requirements with nodes on the same switch to improve network performance. If not enough nodes are available on the same switch to satisfy the job request, the queue will then use any available nodes. This behavior can be changed by using the submission flag " -l nodesetisoptional=false " which forces the queuing system to only run this job on the same switch, thus the job will stay queued until enough nodes on one switch are available to satisfy this request. Note that the maximum number of nodes on one switch is 30, so a request of greater than 30 nodes with this flag will never run.

qsub -l nodes=4:ppn=8,walltime=10:00,nodesetisoptional=false

Multiple Job Submissions

If you are doing doing batch processing of a number of similar jobs on the GPC, torque has a feature called job arrays that can be used to simplify this process. By using the "-t 0-N" option on the command line during job submission or putting it in the job script file, #PBS -t 0-N, torque will expand your single job submission into N+1 jobs and sets the environment variable PBS_ARRAYID equal to that jobs specific number, ie 0-N, for each job. This reduces the amount of calls to qsub, and can allow the user to have many less submission scripts. Job arrays also have the benefit of batching groups of jobs allowing commands like qalter, qdel, qhold to work on all or a subset of the job array jobs with one command, instead of having to run the command for each job.

In the following example, 10 jobs are submitted using a single command

qsub -t 0-9 jobscript.sh

and the submission script then modifies the job based on the PBS_ARRAYID.

#!/bin/bash
#PBS -l nodes=1:ppn=8,walltime=10:00:00
#PBS -N array_jobs

cd ${PBS_O_WORKDIR}
mkdir job.${PBS_ARRAYID}
cd job.${PBS_ARRAYID}

echo "Running job ${PBS_ARRAYID}"
mpirun -np 8 ./mycode >& array_job.${PBS_ARRAYID}.out

The JOBID and the job name both get the additional ARRAYID added onto them in the form of a hyphen, ie JOBID-ARRAYID. If for example you wanted to cancel all the jobs in a job array you would use "qdel JOBID", whereas if you wanted to cancel just one of the jobs you would use "qdel JOBID-ARRAYID".

See here and here for full details.


Job Tools and Scripts

The following list of commands and scripts are useful at the moment of monitoring and managing submissions and jobs. We have discussed them in a recent TechTalk. More details can be found here.


In addition, the following scripts are available:

diskUsage

informs about the user and group file system usage.

quota

offers a shorter version for the user.

qsum

similar to showq, summarizes by user.

jobperf jobID

informs about the performance per-node of a given job.

jobError <jobID | jobNAME>

displays on realtime the error output of a given job.

jobOutput <jobID | jobNAME>

displays on realtime the standard output of a given job.

jobcd <jobID | jobNAME>

allows users to quickly move into the working directory of a given job.

jobscript <jobID | jobNAME>

displays the submission script used when submitting a given job.

jobssh <jobID | jobNAME>

allows users to connect to the head-node of a given job.

jobtop <jobID | jobNAME>

allows users to "top" on the head-node of a given job.

jobtree [user]

displays the jobs tree of dependencies for a given user

jobdep <jobID>

displays the dependencies of a given job.

jobperf <jobID>

displays current performance statistics of a given job (must be running).


Monitoring Jobs

Tech Talk on Monitoring Jobs


Checking the memory usage from jobs

In many occasions it can be really useful to take a look at how much memory your job is using while it is running. There a couple of ways to do so:

1) using some of the command line utilities we have developed, e.g: by using the jobperf or jobtop utilities, it will allow you to check the job performance and head's node utilization respectively.

2) ssh into the nodes where your job is being run and check for memory usage and system stats right there. For instance, trying the 'top' or 'free' commands, in those nodes.

Also, it always a good a idea and strongly encouraged to inspect the standard output-log and error-log generated for your job submissions. These files are named respectively: JobName.{o|e}jobIdNumber; where JobName is the name you gave to the job (via the '-N' PBS flag) and JobIdNumber is the id number of the job. These files are saved in the working directory after the job is finished, but they can be also accessed on real-time using the jobError and jobOutput command line utilities.

Other related topics to memory usage:
Using Ram Disk
Different Memory Configuration nodes
Monitoring Jobs in the Queue