Running Serial Jobs on Niagara

General considerations

Use whole nodes...

When you submit a job to Niagara, it is run on one (or more than one) entire node - meaning that your job is occupying at least 40 processors for the duration of its run. The SciNet systems are usually fully utilized, with many researchers waiting in the queue for computational resources, so we require that you make full use of the nodes that your job is allocated, so other researchers don't have to wait unnecessarily, and so that your jobs get as much work done as possible.

Often, the best way to make full use of the node is to run one large parallel computation; but sometimes it is beneficial to run several serial codes at the same time. On this page, we discuss ways to run suites of serial computations at once, as efficiently as possible, using the full resources of the node.

... memory permitting

When running multiple serial jobs on the same node, it is essential to have a good idea of how much memory the jobs will require. The Niagara compute nodes have about 200GB of memory available to user jobs running on the 40 cores, i.e., a bit over 4GB per core. So the jobs also have to be bunched in ways that will fit into 200GB. If they use more than this, it will crash the node, inconveniencing you and other researchers waiting for that node.

If 40 serial jobs would not fit within the 200GB limit -- i.e. each individual job requires significantly in excess of ~4GB -- then it's allowed to just run fewer jobs so that they do fit. Note that in that case, the jobs are likely candidates for parallelization, and you can contact us at <support@scinet.utoronto.ca> and arrange a meeting with one of the technical analysts to help you with that.

If the memory requirements allow it, you could actually run more than 40 jobs at the same time, up to 80, exploiting the HyperThreading feature of the Intel CPUs. It may seem counter-intuitive, but running 80 simultaneous jobs on 40 cores for certain types of tasks has increased some users overall throughput.

Is your job really serial?

While your program may not be explicitly parallel, it may use some of Niagara's threaded libraries for numerical computations, which can make use of multiple processors. In particular, Niagara's Python and R modules are compiled with aggressive optimization and using threaded numerical libraries which by default will make use of multiple cores for computations such as large matrix operations. This can greatly speed up individual runs, but by less (usually much less) than a factor of 40. If you do have many such threaded computations to do, you often get more calculations done per unit time if you turn off the threading and run multiple such computations at once (provided that fits in memory, as explained above). You can turn off threading of these libraries with the shell script line export OMP_NUM_THREADS=1; that line will be included in the scripts below.

If your calculations implicitly use threading, you may want to experiment to see what gives you the best performance - you may find that running 4 (or even 8) jobs with 10 threads each (OMP_NUM_THREADS=10), or 2 jobs with 20 threads, gives better performance than 40 jobs with 1 thread (and almost certainly better than 1 job with 40 threads). We'd encourage to you to perform exactly such a scaling test to find the combination of number of threads per process and processes per job that maximizes your throughput; for a small up-front investment in time you may significantly speed up all the computations you need to do.

Serial jobs of similar duration

The most straightforward way to run multiple serial jobs is to bunch the serial jobs in groups of 40 or more that will take roughly the same amount of time, and create a job script that looks a bit like this

#!/bin/bash
# SLURM submission script for multiple serial jobs on Niagara
#
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=40
#SBATCH --time=1:00:00
#SBATCH --job-name serialx40

# Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1

# EXECUTION COMMAND; ampersand off 40 jobs and wait
(cd serialjobdir01 && ./doserialjob01 && echo "job 01 finished") &
(cd serialjobdir02 && ./doserialjob02 && echo "job 02 finished") &
(cd serialjobdir03 && ./doserialjob03 && echo "job 03 finished") &
(cd serialjobdir04 && ./doserialjob04 && echo "job 04 finished") &
(cd serialjobdir05 && ./doserialjob05 && echo "job 05 finished") &
(cd serialjobdir06 && ./doserialjob06 && echo "job 06 finished") &
(cd serialjobdir07 && ./doserialjob07 && echo "job 07 finished") &
(cd serialjobdir08 && ./doserialjob08 && echo "job 08 finished") &
(cd serialjobdir09 && ./doserialjob09 && echo "job 09 finished") &
(cd serialjobdir10 && ./doserialjob10 && echo "job 10 finished") &
(cd serialjobdir11 && ./doserialjob11 && echo "job 11 finished") &
(cd serialjobdir12 && ./doserialjob12 && echo "job 12 finished") &
(cd serialjobdir13 && ./doserialjob13 && echo "job 13 finished") &
(cd serialjobdir14 && ./doserialjob14 && echo "job 14 finished") &
(cd serialjobdir15 && ./doserialjob15 && echo "job 15 finished") &
(cd serialjobdir16 && ./doserialjob16 && echo "job 16 finished") &
(cd serialjobdir17 && ./doserialjob17 && echo "job 17 finished") &
(cd serialjobdir18 && ./doserialjob18 && echo "job 18 finished") &
(cd serialjobdir19 && ./doserialjob19 && echo "job 19 finished") &
(cd serialjobdir20 && ./doserialjob20 && echo "job 20 finished") &
(cd serialjobdir21 && ./doserialjob21 && echo "job 21 finished") &
(cd serialjobdir22 && ./doserialjob22 && echo "job 22 finished") &
(cd serialjobdir23 && ./doserialjob23 && echo "job 23 finished") &
(cd serialjobdir24 && ./doserialjob24 && echo "job 24 finished") &
(cd serialjobdir25 && ./doserialjob25 && echo "job 25 finished") &
(cd serialjobdir26 && ./doserialjob26 && echo "job 26 finished") &
(cd serialjobdir27 && ./doserialjob27 && echo "job 27 finished") &
(cd serialjobdir28 && ./doserialjob28 && echo "job 28 finished") &
(cd serialjobdir29 && ./doserialjob29 && echo "job 29 finished") &
(cd serialjobdir30 && ./doserialjob30 && echo "job 30 finished") &
(cd serialjobdir31 && ./doserialjob31 && echo "job 31 finished") &
(cd serialjobdir32 && ./doserialjob32 && echo "job 32 finished") &
(cd serialjobdir33 && ./doserialjob33 && echo "job 33 finished") &
(cd serialjobdir34 && ./doserialjob34 && echo "job 34 finished") &
(cd serialjobdir35 && ./doserialjob35 && echo "job 35 finished") &
(cd serialjobdir36 && ./doserialjob36 && echo "job 36 finished") &
(cd serialjobdir37 && ./doserialjob37 && echo "job 37 finished") &
(cd serialjobdir38 && ./doserialjob38 && echo "job 38 finished") &
(cd serialjobdir39 && ./doserialjob39 && echo "job 39 finished") &
(cd serialjobdir40 && ./doserialjob40 && echo "job 40 finished") &
wait

There are four important things to take note of here. First, the wait command at the end is crucial; without it the job will terminate immediately, killing the 40 programs you just started.

Second is that every serial job is running in its own directory; this is important because writing to the same directory from different processes can lead to slow down because of directory locking. How badly your job suffers from this depends on how much I/O your serial jobs are doing, but with 40 jobs on a node, it can quickly add up.

Third is that it is important to group the programs by how long they will take. If (say) dojob08 takes 2 hours and the rest only take 1, then for one hour 39 of the 40 cores on that Niagara node are wasted; they are sitting idle but are unavailable for other users, and the utilization of this node over the whole run is only 51%. This is the sort of thing we'll notice, and users who don't make efficient use of the machine will have their ability to use Niagara resources reduced. If you have many serial jobs of varying length, use the submission script to balance the computational load, as explained below.

Fourth, if memory requirements allow it, you should try to run more than 40 jobs at once, with a maximum of 80 jobs.

Finally, writing out 80 cases (or even just 40, as in the above example) can become highly tedious, as can keeping track of all these subjobs. You should consider using a tool that automates this, like:

GNU Parallel

GNU parallel is a really nice tool written by Ole Tange to run multiple serial jobs in parallel. It allows you to keep the processors on each 40-core node busy, if you provide enough jobs to do.

GNU parallel is accessible on Niagara in the module gnu-parallel:

module load NiaEnv/2019b gnu-parallel

This also switches to the newer NiaEnv/2019b stack. The current version of the GNU parallel module in that stack is 20191122. In the older stack, NiaEnv/2018a (which is loaded by default), the version of GNU parallel is 20180322.

The command man parallel_tutorial shows much of GNU parallel's functionality, while man parallel gives the details of its syntax.

The citation for GNU Parallel is: O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014.

It is easiest to demonstrate the usage of GNU parallel by examples. First, suppose you have 80 jobs to do (similar to the above case), and that these jobs duration varies quite a bit, but that the average job duration is around 5 hours. You could use the following script (but don't, see below):

#!/bin/bash
# SLURM submission script for multiple serial jobs on Niagara
#
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=40
#SBATCH --time=12:00:00
#SBATCH --job-name gnu-parallel-example

# Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1

module load NiaEnv/2019b gnu-parallel

# EXECUTION COMMAND - DON'T USE THIS ONE
parallel -j $SLURM_TASKS_PER_NODE <<EOF
  cd serialjobdir01 && ./doserialjob01 && echo "job 01 finished"
  cd serialjobdir02 && ./doserialjob02 && echo "job 02 finished"
  ...
  cd serialjobdir80 && ./doserialjob80 && echo "job 80 finished"
EOF

The -j $SLURM_TASKS_PER_NODE parameter sets the number of jobs to run at the same time on each compute node, and is using the slurm value, which coincides with the --ntasks-per-node parameter. For gpu-parallel modules starting from version 20191122, if you omit the option -j $SLURM_TASKS_PER_NODE, you will get as many simultaneous subjobs as the ntask-per-node parameter you specify in the #SBATCH part of the jobs script.

Each line in the input given to parallel is a separate subjob, so 80 jobs are lined up to run. Initially, 40 subjobs are given to the 40 processors on the node. When one of the processors is done with its assigned subjob, it will get a next subjob instead of sitting idle until the other processors are done. While you would expect that on average this script should take 10 hours (each processor on average has to complete two jobs of 5 hours), there's a good chance that one of the processors gets two jobs that take more than 5 hours, so the job script requests 12 hours to be safe. How much more time you should ask for in practice depends on the spread in expected run times of the separate jobs.

Serial jobs of varying duration

The script above works, and can be extended to more subjobs, which is especially important if you have to do a lot (100+) of relatively short serial runs of which the walltime varies. But it gets tedious to write out all the cases. You could write a script to automate this, but you do not have to, because GNU Parallel already has ways of generating subjobs, as we will show below.

GNU Parallel can also keep track of the subjobs with succeeded, failed, or never started. For that, you just add --joblog to the parallel command followed by a filename to which to write the status:

# EXECUTION COMMAND - DON'T USE THIS ONE
parallel --joblog slurm-$SLURM_JOBID.log -j $SLURM_TASKS_PER_NODE <<EOF
  cd serialjobdir01 && ./doserialjob01
  cd serialjobdir02 && ./doserialjob02
  ...
  cd serialjobdir80 && ./doserialjob80
EOF

In this case, the job log gets written to "slurm-$SLURM_JOBID.log", where "$SLURM_JOBID" will be replaced by the job number. The joblog can also be used to retry failed jobs (more below).

Second, we can generate that set of subjobs instead of writing them out by hand. The following does the trick:

# EXECUTION COMMAND 
parallel --joblog slurm-$SLURM_JOBID.log -j $SLURM_TASKS_PER_NODE "cd serialjobdir{} && ./doserialjob{}" ::: {01..80}

This works as follows: "cd serialjobdir{} && ./doserialjob{}" is a template command, with placeholders {}. ::: indicated that a set of parameters follows that are to be put into the template, thus generating the commands for each subjob. After the ::: we can place a space-separated set of arguments, which in this case are generated using the bash-specific construct for a range, {01..80}.

The final script now looks like this:

#!/bin/bash
# SLURM submission script for multiple serial jobs on Niagara
#
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=40
#SBATCH --time=12:00:00
#SBATCH --job-name gnu-parallel-example

# Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1

module load NiaEnv/2019b gnu-parallel 

# EXECUTION COMMAND 
parallel --joblog slurm-$SLURM_JOBID.log "cd serialjobdir{} && ./doserialjob{}" ::: {01..80}

Notes:

As before, GNU Parallel keeps 40 jobs running at a time, and if one finishes, starts the next. This is an easy way to do load balancing.
The -j option was omitted, which works if using GNU Parallel module version 20191122 or higher. Otherwise, you need to add the -j $SLURM_TASKS_PER_NODE flag to the parallel command.
Doing many serial jobs often entails doing many disk reads and writes, which can be detrimental to the performance. In that case, running from the ramdisk may be an option.
- When using a ramdisk, make sure you copy your results from the ramdisk back to the scratch after the runs, or when the job is killed because time has run out.
- More details on how to setup your script to use the ramdisk can be found on the Ramdisk page.
This script optimizes resource utility, but can only use 1 node (40 cores) at a time. The next section addresses how to use more nodes.
While on the command line, the option "--bar" can be nice to see the progress, when running as a job, you would not see this status bar.
The --joblog parameter also keeps track of failed or unfinished jobs, so you can later try to redo those with the same command, but with the option "--resume" added.
If it happens that your serial jobs are running out of memory and being killed by the system, the --memfree size option can be helpful. It sets the minimum memory free when starting another job. On Niagara, size could be set to 15000M for example to match what the RealMemory slurm configuration provides to users on compute nodes. You might have to adjust it if your jobs do make use of ramdisk to hold data for example.

Version for more than 1 node at once

If you have many hundreds of serial jobs that you want to run concurrently and the nodes are available, then the approach above, while useful, would require tens of scripts to be submitted separately. Alternatively, it is possible to request more than one node and to use the following routine to distribute your processes amongst the cores.

Although it is not recommended to use GNU parallel modules before version 20191122, if you do, the script should look like this:

#!/bin/bash
# SLURM submission script for multiple serial jobs on multiple Niagara nodes
#
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=40
#SBATCH --time=12:00:00
#SBATCH --job-name gnu-parallel-multinode-example
 
# Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1
 
module load gnu-parallel

HOSTS=$(scontrol show hostnames $SLURM_NODELIST | tr '\n' ,)
NCORES=40

parallel --env OMP_NUM_THREADS,PATH,LD_LIBRARY_PATH --joblog slurm-$SLURM_JOBID.log -j $NCORES -S $HOSTS --wd $PWD "cd serialjobdir{} && ./doserialjob{}" ::: {001..800}

The parameter -S $HOSTS divides the work over different nodes. $HOSTS should be a comma separated list of the node names. These node names are also stored in $SLURM_NODELIST, but with a syntax that allows for ranges, which GNU parallel does not understand. The scontrol command in the script above fixes that.
Alternatively, GNU Parallel can be passed a file with the list of nodes to which to ssh, using --sshloginfile, but your jobs script would first have to create that file.
The parameter -j $NCORES tells parallel to run 40 subjobs simultaneously on each of the nodes (note: do not use the similarly named variable $SLURM_TASKS_PER_NODE as its format is incompatible with GNU parallel).
The parameter --wd $PWD sets the working directory on the other nodes to the working directory on the first node. The --wd argument is essential: without this, the run tries to start from the wrong place and will most likely fail.
If you need an environment variable to be transfered from the job script to the remotely running subjobs, use the --env ENVIRONMENTVARIABLE argument for the parallel command. The example above copies the most common variables that a remote command may need.

Instead of this script using an old version of GNU parallel, we recommend using GNU parallel modules starting from version 20191122 that is available in NiaEnv/2019b, which facilitate automatic distribution of subjobs over nodes. For these newer versions of the module, the script can look like this:

#!/bin/bash
# SLURM submission script for multiple serial jobs on multiple Niagara nodes
#
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=40
#SBATCH --time=12:00:00
#SBATCH --job-name gnu-parallel-multinode-example
 
# Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1
 
module load NiaEnv/2019b gnu-parallel

parallel --joblog slurm-$SLURM_JOBID.log --wd $PWD "cd serialjobdir{} && ./doserialjob{}" ::: {001..800}

The mechanism of the automation of the number of tasks per nodes and the node names that GNU Parallel can use, is all through the environment variable $PARALLEL, which is set by the gnu-parallel module.
The parameter --wd $PWD sets the working directory on the other nodes to the working directory on the first node. The --wd argument is essential: without this, the run tries to start from the wrong place and will most likely fail.
If you need an environment variable to be transfered from the job script to the remotely running subjobs, use the --env ENVIRONMENTVARIABLE argument for the parallel command. The $PARALLEL environment variable is already set to copy the most common variables $PATH, $LD_LIBRARY_PATH, and $OMP_NUM_THREADS.

Of course, this is just an example of what you could do with gnu parallel. How you set up your specific run depends on how each of the runs would be started. One could for instance also prepare a file of commands to run and make that the input to parallel as well.

Submitting several bunches to single nodes, as in the section above, is a more fail-safe way of proceeding, since a node failure would only affect one of these bunches, rather than all runs.

We reiterate that if memory requirements allow it, you should try to run more than 40 jobs at once, with a maximum of 80 jobs. The way the above example job script are written, you simple change #SBATCH --ntasks-per-node=40 to #SBATCH --ntasks-per-node=80 to accomplish this.

More on GNU parallel

The documentation for GNU parallel can be found at http://www.gnu.org/software/parallel/ .
After loading the gnu-parallel module, type man parallel_tutorial
After loading the gnu-parallel module, type man parallel
The man page can also be found at http://www.gnu.org/software/parallel/man.html .
Watch a recording of a Compute Ontario Colloquium</a> on GNU parallel.

Bundling multiple MPI sub-jobs on trillium

There may be cases in which you code won't scale well to all 192 cores on a trillium node. In that case you may still fully utilize the node, by running multiple sub-jobs in parallel, even if they are not serial:

#!/bin/bash
# SLURM submission script for multiple MPI jobs on a full niagara node
#
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=192
#SBATCH --time=12:00:00
#SBATCH --job-name gnu-parallel-MPI-example

# EXECUTION COMMAND; ampersand off 3 sub-jobs, 64-tasks-each and wait
(mpirun -N 64 ./test_solver 3 && echo "job 1 finished") &
(mpirun -N 64 ./test_solver 4 && echo "job 1 finished") &
(mpirun -N 64 ./test_solver 5 && echo "job 1 finished") &
wait

GNU Parallel Reference

The author of GNU parallel request that when using GNU parallel for a publication, you please cite:

O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014.