Slurm

From SciNet Users Documentation
Jump to navigation Jump to search

The queueing system used at SciNet is based around the Slurm Workload Manager. This "scheduler", Slurm, determines which jobs will be run on which compute nodes, and when. This page outlines how to submit jobs, how to interact with the scheduler, and some of the most common Slurm commands.

Some common questions about the queuing system can be found on the FAQ as well.

Submitting jobs

You submit jobs from a Niagara login node. This is done by passing a script to the sbatch command:

nia-login07:~$ sbatch jobscript.sh

This puts the job, described by the job script, into the queue. The scheduler will will run the job on the compute nodes in due course. A typical submission script is as follows.

#!/bin/bash 
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40
#SBATCH --time=1:00:00
#SBATCH --job-name mpi_job
#SBATCH --output=mpi_output_%j.txt
#SBATCH --mail-type=FAIL

cd $SLURM_SUBMIT_DIR

module load intel/2018.2
module load openmpi/3.1.0

mpirun ./mpi_example
# or "srun ./mpi_example"

Some notes about this example:

  • The first line indicates that this is a bash script.
  • Lines starting with #SBATCH go to SLURM.
  • sbatch reads these lines as a job request (which it gives the name mpi_job).
  • In this case, SLURM looks for 2 nodes with 40 cores on which to run 80 tasks, for 1 hour.
  • Note that the mpifun flag "--ppn" (processors per node) is ignored. Slurm takes care of this detail.
  • Once the scheduler finds a spot to run the job, it runs the script:
    • It changes to the submission directory;
    • Loads modules;
    • Runs the mpi_example application.
  • To use hyperthreading, just change --ntasks-per-node=40 to --ntasks-per-node=80, and add --bind-to none to the mpirun command (the latter is necessary for OpenMPI only, not when using IntelMPI).

To create a job script appropriate for your work, you must modify the commands above to instruct Slurm to run the commands you need run.

Things to remember

There are some things to always bear in mind when crafting your submission script:

  • Scheduling is by node, so in multiples of 40 cores. You are expected to use all 40 cores! If you are running serial jobs, and need assistance bundling your work into multiples of 40, please see the serial jobs page.
  • Jobs must write to your scratch or project directory (home is read-only on compute nodes).
  • Compute nodes have no internet access. Download data you need before submitting your job.
  • Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below).
  • Jobs will run under your group's RRG allocation. If your group does not have an allocation, your job will run under your group's RAS allocation (previously called `default' allocation). Note that groups with an allocation cannot run under a default allocation.
  • The maximum walltime for all users is 24 hours. The minimum and default walltime is 15 minutes.

Scheduling details

We now present the details of how to write a job script, and some extra commands which you might find useful.

SLURM nomenclature: jobs, nodes, tasks, cpus, cores, threads

SLURM has a somewhat different way of referring to things like MPI processes and thread tasks, as compared to our previous scheduler, MOAB. The SLURM nomenclature is reflected in the names of scheduler options (i.e., resource requests). SLURM strictly enforces those requests, so it is important to get this right.

term meaning SLURM term related scheduler options
job scheduled piece of work for which specific resources were requested. job sbatch, salloc
node basic computing component with several cores (40 for Niagara) that share memory node --nodes -N
mpi process one of a group of running programs using Message Passing Interface for parallel computing task --ntasks -n --ntasks-per-node
core or physical cpu A fully functional independent physical execution unit. - -
logical cpu An execution unit that the operating system can assign work to. Operating systems can be configured to overload physical cores with multiple logical cpus using hyperthreading. cpu --cpus-per-task
thread one of possibly multiple simultaneous execution paths within a program, which can share memory. - --cpus-per-task and OMP_NUM_THREADS
hyperthread a thread run in a collection of threads that is larger than the number of physical cores. - -

Scheduling by Node

  • On many systems that use SLURM, the scheduler will deduce from the job script specifications (the number of tasks and the number of cpus-per-node) what resources should be allocated. On Niagara, this is a bit different.
  • All job resource requests on Niagara are scheduled as a multiple of nodes.
  • The nodes that your jobs run on are exclusively yours.
    • No other users are running anything on them.
    • You can ssh into them, while your job is running, to see how things are going.
  • Whatever you request of the scheduler, your request will always be translated into a multiple of nodes allocated to your job.
  • Memory requests to the scheduler are of no use. Your job always gets N x 202GB of RAM, where N is the number of nodes. Each node has about 202GB of RAM available.
  • You should try to use all the cores on the nodes allocated to your job. Since there are 40 cores per node, your job should use N x 40 cores. If this is not the case, we will be contacted you to help you optimize your workflow. Again, users which have serials jobs should consult the serial jobs page.

Hyperthreading: Logical CPUs vs. cores

Hyperthreading, a technology that leverages more of the physical hardware by pretending there are twice as many logical cores than real cores, is enabled on Niagara. The operating system and scheduler see 80 logical CPUs.

Using 80 logical CPUs versus 40 real cores typically gives about a 5-10% speedup, depending on your application (your mileage may vary).

Because Niagara is scheduled by node, hyperthreading is actually fairly easy to use:

  • Ask for a certain number of nodes, N, for your job.
  • You know that you get 40 x N cores, so you will use (at least) a total of 40 x N MPI processes or threads (mpirun, srun, and the OS will automaticallly spread these over the real cores).
  • But you should also test if running 80 x N MPI processes or threads gives you any speedup.
  • Regardless, your usage will be counted as 40 x N x (walltime in years).

Many applications which are communication-heavy can benefit from the use of hyperthreading.

Submission script details

This section outlines some details of how to interact with the scheduler, and how it implements Niagara's scheduling policies.

Queues

There are 3 queues available on SciNet systems. These queues have different limits; see the Limits section for further details.

Compute

The compute queue is the default queue. Most jobs will run in this queue. If no flags are specified in the submission script this is the queue where your job will land.

Debug

The Debug queue is a high-priority queue, used for short-term testing of your code. Do NOT use the debug queue for production work. You can use the debug queue one of two ways. To submit a standard job script to the debug queue, add the line

#SBATCH -p debug

to your submission script. This will put the job into the debug queue, and it should run in short order.

To request an interactive debug session, where you retain control over the command line prompt, at a login node type the command

 nia-login07:~$ salloc -p debug --nodes 1 --time=1:00:00

This will request 1 node for 1 hour. You can similarly request a debug session using the 'debugjob' command:

 nia-login07:~$ debugjob N

where N is the number of nodes, If N=1, this gives an interactive session one 1 hour, when N=4 (the maximum), it gives you 30 minutes.

Archive

The archivelong and archiveshort queues are only used by the HPSS system. See that page for details on how to use these queues.

Limits

There are limits to the size and duration of your jobs, the number of jobs you can run and the number of jobs you can have queued. It matters whether a user is part of a group with a Resources for Research Group allocation or not. It also matters in which 'partition' the jobs runs. 'Partitions' are SLURM-speak for use cases. You specify the partition with the -p parameter to sbatch or salloc, but if you do not specify one, your job will run in the compute partition, which is the most common case.

Usage Partition Running jobs Jobs in queue Min. size of jobs Max. size of jobs Min. walltime Max. walltime
Compute jobs compute 50 1000 1 node (40 cores) default: 20 nodes (800 cores)
with allocation: 1000 nodes (40000 cores)
15 minutes 24 hours
Testing or troubleshooting debug 1 1 1 node (40 cores) 4 nodes (160 cores) N/A min(1, 1.5/nnode) hours
Archiving or retrieving data in HPSS archivelong 2 per user (max 5 total) 10 per user N/A N/A 15 minutes 72 hours
Inspecting archived data, small archival actions in HPSS archiveshort 2 per user 10 per user N/A N/A 15 minutes 1 hour

Even if you respect these limits, your jobs will still have to wait in the queue. The waiting time depends on many factors such as your group's allocation amount, how much allocation has been used in the recent past, the number of requested nodes and walltime, and how many other jobs are waiting in the queue.

Slurm Accounts

To be able to prioritise jobs based on groups and allocations, the Slurm scheduler uses the concept of accounts. Each group that has a Resource for Research Groups (RRG) or Research Platforms and Portals (RPP) allocation (awarded through an annual competition by Compute Canada) has an account that starts with rrg- or rpp-. Slurm assigns a 'fairshare' priority to these accounts based on the size of the award in core-years. Groups without an RRG or RPP can use Niagara using a so-called Rapid Access Service (RAS), and have an account that starts with def-.

On Niagara, most users will only ever use one account, and those users do not need to specify the account to Slurm. However, users that are part of collaborations may be able to use multiple accounts, i.e., that of their sponsor and that of their collaborator, but this mean that they need to select the right account when running jobs.

To select the account, just add

   #SBATCH -A [account]

to the job scripts, or use the -A [account] to salloc or debugjob.

To see which accounts you have access to, or what their names are, use the command

   sshare -U

It has been noted that, in some cases, using the '-A' flag does not result in the appropriate account being used. To get around this, specify the account when sbatch is invoked:

   sbatch -A account myjobscript.sh

Slurm environment variables

There are many environment variables built into Slurm. These are some which you may find useful:

  • SLURM_SUBMIT_DIR: directory from which the job was submitted.
  • SLURM_SUBMIT_HOST: host from which the job was submitted.
  • SLURM_JOB_ID: the job's id.
  • SLURM_JOB_NUM_NODES: number of nodes in the job.
  • SLURM_JOB_NODELIST: list of nodes assigned to the job.
  • SLURM_JOB_ACCOUNT: account associated with the job.

Any of these environment variables can be accessed from within your job script.

Passing Variables to submission scripts

It is possible to pass values through environment variables into your SLURM submission scripts. For doing so with already defined variables in your shell, just add the following directive in the submission script,

#SBATCH --export=ALL

and you will have access to any predefined environment variable.

A better way is to specify explicitly which variables you want to pass into the submision script,

sbatch --export=i=15,j='test' jobscript.sbatch

You can even set the job name and output files using environment variables, eg.

i="simulation"
j=14
sbatch --job-name=$i.$j.run --output=$i.$j.out --export=i=$i,j=$j jobscript.sbatch

(The latter only works on the command line; you cannot use environment variables in #SBATCH lines in the job script.)

Command line arguments

Command line arguments can also be used for job script in the same way as command line argument for shell scripts. All command line arguments given to sbatch that follow after the job script name, will be passed to the job script. In fact, SLURM will not look at any of these arguments, so you must place all sbatch arguments before the script name, e.g.:

sbatch  -p debug  jobscript.sbatch  FirstArgument SecondArgument ...

In this example, -p debug is interpreted by SLURM, while in your submission script you can access FirstArgument, SecondArgument, etc., by referring to $1, $2, ....

Job arrays

Sometimes you need to run the same job script many times, but just tweaking one value each time. One way of accomplishing this is using job arrays. Job arrays are invoked using the "-a" flag with sbatch:

sbatch -a 1-100 myjobscript.sh

This will submit 100 instances of myjobscript.sh. Within the job script you can distinguish which of those instances is running using the environment variable SLURM_ARRAY_TASK_ID.

Note that Niagara currently has a limit of 1000 submitted jobs for users within groups with allocations, and 200 submitted jobs without an allocation.

Job dependencies

You can make one job dependent on the successful completion of another job using the following command:

 sbatch --dependency=afterok:JOBID myjobscript.sh

This will make the current job submission not start until the parent job, with jobid JOBID, successfully completes. There are many job dependency options available. Visit the Slurm sbatch page for the full list.

If the parent job fails (that is, ends with a non-zero exit code) the dependent job can never be scheduled and will be automatically cancelled.

Email Notification

Email notification works, but you need to add the email address and type of notification you may want to receive in your submission script, eg.

   #SBATCH --mail-user=YOUR.email.ADDRESS
   #SBATCH --mail-type=ALL

The sbatch man page (type man sbatch on Niagara) explains all possible mail-types.

Job Location Constraints

Node types

With the expansion of Niagara there are now two node types, 1548 Intel 6148 "skylake" CPU based nodes, and 468 Intel 6248 "cascadelake" CPU based nodes. By default a job will be placed on the first available nodes but will not span node types. You can specify a node type using one of the following directives to your submission script.

   #SBATCH --constraint=skylake 
   #SBATCH --constraint=cascade

EDR/HDR Infiniband Topology

The Infiniband high speed network used for job communication and file I/O on Niagara consists of 5 1:1 subscribed "wings" that connected together in a dragonfly topology with adaptive routing enabled. 4 wings (dragonfly[1-4]) consist of EDR based skylake nodes and dragonfly5 contains all the of HDR100 cascadelake nodes. By default multi-node jobs will run on the first available nodes which could be all within 1 wing, or span across multiple wings, but not across node types. For most scalable parallel programs the performance difference should not be very significant, however if you wish keep your jobs from spanning wings you can use the following.

   #SBATCH --constraint=[dragonfly1|dragonfly2|dragonfly3|dragonfly4|dragonfly5]

(note that the brackets are part of the syntax here.)

Monitoring jobs

There are many options available for monitoring your jobs. The most basic of which is the squeue command:

nia-login07:~$ squeue -u USERNAME
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   292047   compute   myjob4 username PD       0:00      4 (Priority)
   292048   compute   myjob3 username PD       0:00      4 (Priority)
   266829   compute   myjob2 username  R   18:56:17      2 nia[1397-1398]
   266828   compute   myjob1 username  R   18:56:46      1 nia1298

Here you can see that we have two running jobs ('R') and two pending jobs ('PD'). The nodes being used are listed.

Job status

To get an estimate of when a job will start, use the command

 squeue --start -j JOBID

Note that this is only an estimate, and tends not to be very accurate.

Information about a specific job can be found using the

 squeue -j JOBID

or alternatively

 scontrol show job JOBID

which is more verbose.

SSHing to a node

Once your job has started, the node belongs to you. As such you may, from a login node, SSH into the node to check the performance of your job. The first step is to find out which nodes are being used (see above). Once you have your list of nodes, you can SSH into them directly. Once there, you can run the 'top' or 'free' commands to check both CPU and memory usage.

jobperf

The jobperf script will give you feedback on the performance of your currently-running job:

nia-login07:~$ jobperf 123456
----------------------------------------------------------------------------------------------------
                   RUNNING          IDLE      USER       MEMORY(MB)          PROCESS NAMES
   HOSTNAME     #  %CPU  %MEM    DISK SLEEP   NAME    RAMDISK  USED AVAIL    (excl:bash,sh,ssh,sshd)
----------------------------------------------------------------------------------------------------
nia1013         71   174%  0.5%      0   22    ejspence      0  15060  178017   14*gmx_mpi mpiexec slurm_script
nia1014         79   192%  0.1%      0   18    ejspence      0  14803  178274   13*gmx_mpi
nia1295         79   188%  0.4%      0   18    ejspence      0  15199  177878   13*gmx_mpi
----------------------------------------------------------------------------------------------------

Here you can see both the CPU and memory usage of the job, for all nodes being used.

Other commands

Some other commands had can be useful for dealing with your jobs:

  • scancel -i JOBID cancels a specific job.
  • sacct gives information about your recent jobs.
  • sinfo -p compute gives a list of available nodes.
  • qsum gives a summary of the queue by user.

Example submission scripts

Here we present some examples of how to create submission scripts for running parallel jobs. Serial job examples can be found on the serial jobs page.

Example submission script (MPI)

#!/bin/bash 
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=40
#SBATCH --time=1:00:00
#SBATCH --job-name mpi_job
#SBATCH --output=mpi_output_%j.txt
#SBATCH --mail-type=FAIL

cd $SLURM_SUBMIT_DIR

module load intel/2018.2
module load openmpi/3.1.0

mpirun ./mpi_example
# or "srun ./mpi_example"

Submit this script with the command:

   nia-login07:~$ sbatch mpi_job.sh
  • First line indicates that this is a bash script.

  • Lines starting with #SBATCH go to SLURM.

  • sbatch reads these lines as a job request (which it gives the name mpi_job)

  • In this case, SLURM looks for 8 nodes with 40 cores on which to run 320 tasks, for 1 hour.

  • Note that the mpifun flag "--ppn" (processors per node) is ignored.

  • Once it found such a node, it runs the script:

    • Change to the submission directory;
    • Loads modules;
    • Runs the mpi_example application.
  • To use hyperthreading, just change --ntasks-per-node=40 to --ntasks-per-node=80, and add --bind-to none to the mpirun command (the latter is necessary for OpenMPI only, not when using IntelMPI).

Example submission script (OpenMP)

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=40
#SBATCH --time=1:00:00
#SBATCH --job-name openmp_job
#SBATCH --output=openmp_output_%j.txt
#SBATCH --mail-type=FAIL

module load intel/2018.2

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./openmp_example
# or "srun ./openmp_example".

Submit this script with the command:

   nia-login07:~$ sbatch openmp_job.sh
  • First line indicates that this is a bash script.
  • Lines starting with #SBATCH go to SLURM.
  • sbatch reads these lines as a job request (which it gives the name openmp_job) .
  • In this case, SLURM looks for one node with 40 cores to be run inside one task, for 1 hour.
  • Once it found such a node, it runs the script:
    • Change to the submission directory;
    • Loads modules;
    • Sets an environment variable;
    • Runs the openmp_example application.
  • To use hyperthreading, just change --cpus-per-task=40 to --cpus-per-task=80.