Running Serial Jobs on Niagara

General considerations

Use whole nodes...

When you submit a job to Niagara, it is run on one (or more than one) entire node - meaning that your job is occupying at least 40 processors for the duration of its run. The SciNet systems are usually fully utilized, with many researchers waiting in the queue for computational resources, so we require that you make full use of the nodes that your job is allocated, so other researchers don't have to wait unnecessarily, and so that your jobs get as much work done as possible.

Often, the best way to make full use of the node is to run one large parallel computation; but sometimes it is beneficial to run several serial codes at the same time. On this page, we discuss ways to run suites of serial computations at once, as efficiently as possible, using the full resources of the node.

... memory permitting

When running multiple serial jobs on the same node, it is essential to have a good idea of how much memory the jobs will require. The Niagara compute nodes have about 200GB of memory available to user jobs running on the 40 cores, i.e., a bit over 4GB per core. So the jobs also have to be bunched in ways that will fit into 200GB. If they use more than this, it will crash the node, inconveniencing you and other researchers waiting for that node.

If 40 serial jobs would not fit within the 200GB limit -- i.e. each individual job requires significantly in excess of ~4GB -- then it's allowed to just run fewer jobs so that they do fit. Note that in that case, the jobs are likely candidates for parallelization, and you can contact us at <support@scinet.utoronto.ca> and arrange a meeting with one of the technical analysts to help you with that.

If the memory requirements allow it, you could actually run more than 40 jobs at the same time, up to 80, exploiting the HyperThreading feature of the Intel CPUs. It may seem counter-intuitive, but running 80 simultaneous jobs on 40 cores for certain types of tasks has increased some users overall throughput.

Is your job really serial?

While your program may not be explicitly parallel, it may use some of Niagara's threaded libraries for numerical computations, which can make use of multiple processors. In particular, Niagara's Python and R modules are compiled with aggressive optimization and using threaded numerical libraries which by default will make use of multiple cores for computations such as large matrix operations. This can greatly speed up individual runs, but by less (usually much less) than a factor of 40. If you do have many such threaded computations to do, you often get more calculations done per unit time if you turn off the threading and run multiple such computations at once (provided that fits in memory, as explained above). You can turn of threading of these libraries with the shell script line export OMP_NUM_THREADS=1; that line will be included in the scripts below.

If your calculations implicitly use threading, you may want to experiment to see what gives you the best performance - you may find that running 4 (or even 8) jobs with 10 threads each (OMP_NUM_THREADS=10), or 2 jobs with 20 threads, gives better performance than 40 jobs with 1 thread (and almost certainly better than 1 job with 40 threads). We'd encourage to you to perform exactly such a scaling test to find the combination of number of threads per process and processes per job that maximizes your throughput; for a small up-front investment in time you may significantly speed up all the computations you need to do.

Serial jobs of similar duration

The most straightforward way to run multiple serial jobs is to bunch the serial jobs in groups of 40 or more that will take roughly the same amount of time, and create a job script that looks a bit like this

#!/bin/bash
# SLURM submission script for multiple serial jobs on Niagara
#
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=40
#SBATCH --time=1:00:00
#SBATCH --job-name serialx40

# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is the directory from which the job was submitted
cd $SLURM_SUBMIT_DIR

# Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1

# EXECUTION COMMAND; ampersand off 40 jobs and wait
(cd serialjobdir01 && ./doserialjob01 && echo "job 01 finished") &
(cd serialjobdir02 && ./doserialjob02 && echo "job 02 finished") &
(cd serialjobdir03 && ./doserialjob03 && echo "job 03 finished") &
(cd serialjobdir04 && ./doserialjob04 && echo "job 04 finished") &
(cd serialjobdir05 && ./doserialjob05 && echo "job 05 finished") &
(cd serialjobdir06 && ./doserialjob06 && echo "job 06 finished") &
(cd serialjobdir07 && ./doserialjob07 && echo "job 07 finished") &
(cd serialjobdir08 && ./doserialjob08 && echo "job 08 finished") &
(cd serialjobdir09 && ./doserialjob09 && echo "job 09 finished") &
(cd serialjobdir10 && ./doserialjob10 && echo "job 10 finished") &
(cd serialjobdir11 && ./doserialjob11 && echo "job 11 finished") &
(cd serialjobdir12 && ./doserialjob12 && echo "job 12 finished") &
(cd serialjobdir13 && ./doserialjob13 && echo "job 13 finished") &
(cd serialjobdir14 && ./doserialjob14 && echo "job 14 finished") &
(cd serialjobdir15 && ./doserialjob15 && echo "job 15 finished") &
(cd serialjobdir16 && ./doserialjob16 && echo "job 16 finished") &
(cd serialjobdir17 && ./doserialjob17 && echo "job 17 finished") &
(cd serialjobdir18 && ./doserialjob18 && echo "job 18 finished") &
(cd serialjobdir19 && ./doserialjob19 && echo "job 19 finished") &
(cd serialjobdir20 && ./doserialjob20 && echo "job 20 finished") &
(cd serialjobdir21 && ./doserialjob21 && echo "job 21 finished") &
(cd serialjobdir22 && ./doserialjob22 && echo "job 22 finished") &
(cd serialjobdir23 && ./doserialjob23 && echo "job 23 finished") &
(cd serialjobdir24 && ./doserialjob24 && echo "job 24 finished") &
(cd serialjobdir25 && ./doserialjob25 && echo "job 25 finished") &
(cd serialjobdir26 && ./doserialjob26 && echo "job 26 finished") &
(cd serialjobdir27 && ./doserialjob27 && echo "job 27 finished") &
(cd serialjobdir28 && ./doserialjob28 && echo "job 28 finished") &
(cd serialjobdir29 && ./doserialjob29 && echo "job 29 finished") &
(cd serialjobdir30 && ./doserialjob30 && echo "job 30 finished") &
(cd serialjobdir31 && ./doserialjob31 && echo "job 31 finished") &
(cd serialjobdir32 && ./doserialjob32 && echo "job 32 finished") &
(cd serialjobdir33 && ./doserialjob33 && echo "job 33 finished") &
(cd serialjobdir34 && ./doserialjob34 && echo "job 34 finished") &
(cd serialjobdir35 && ./doserialjob35 && echo "job 35 finished") &
(cd serialjobdir36 && ./doserialjob36 && echo "job 36 finished") &
(cd serialjobdir37 && ./doserialjob37 && echo "job 37 finished") &
(cd serialjobdir38 && ./doserialjob38 && echo "job 38 finished") &
(cd serialjobdir39 && ./doserialjob39 && echo "job 39 finished") &
(cd serialjobdir40 && ./doserialjob40 && echo "job 40 finished") &
wait

There are four important things to take note of here. First, the wait command at the end is crucial; without it the job will terminate immediately, killing the 40 programs you just started.

Second is that every serial job is running in its own directory; this is important because writing to the same directory from different processes can lead to slow down because of directory locking. How badly your job suffers from this depends on how much I/O your serial jobs are doing, but with 40 jobs on a node, it can quickly add up.

Third is that it is important to group the programs by how long they will take. If (say) dojob08 takes 2 hours and the rest only take 1, then for one hour 39 of the 40 cores on that Niagara node are wasted; they are sitting idle but are unavailable for other users, and the utilization of this node over the whole run is only 51%. This is the sort of thing we'll notice, and users who don't make efficient use of the machine will have their ability to use Niagara resources reduced. If you have many serial jobs of varying length, use the submission script to balance the computational load, as explained below.

Fourth, if memory requirements allow it, you should try to run more than 40 jobs at once, with a maximum of 80 jobs.

Finally, writing out 80 cases (or even just 40, as in the above example) can become highly tedious, as can keeping track of all these subjobs. You should consider using a tool that automates this, like:

GNU Parallel

GNU parallel is a really nice tool written by Ole Tange to run multiple serial jobs in parallel. It allows you to keep the processors on each 40-core node busy, if you provide enough jobs to do.

GNU parallel is accessible on Niagara in the module gnu-parallel/20180322:

module load gnu-parallel/20180322

The citation for GNU Parallel is: O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

It is easiest to demonstrate the usage of GNU parallel by examples. First, suppose you have 80 jobs to do (similar to the above case), and that these jobs duration varies quite a bit, but that the average job duration is around 5 hours. You could use the following script (but don't, see below):

#!/bin/bash
# SLURM submission script for multiple serial jobs on Niagara
#
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=40
#SBATCH --time=12:00:00
#SBATCH --job-name gnu-parallel-example

# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is the directory from which the job was submitted
cd $SLURM_SUBMIT_DIR

# Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1

module load gnu-parallel/20180322  

# EXECUTION COMMAND - DON'T USE THIS ONE
parallel -j $SLURM_TASKS_PER_NODE <<EOF
  cd serialjobdir01 && ./doserialjob01 && echo "job 01 finished"
  cd serialjobdir02 && ./doserialjob02 && echo "job 02 finished"
  ...
  cd serialjobdir80 && ./doserialjob80 && echo "job 80 finished"
EOF

The -j $SLURM_TASKS_PER_NODE parameter sets the number of jobs to run at the same time, and is using the slurm value, which coincides with the --ntasks-per-node parameter. Each line in the input given to parallel is a separate subjob, so 80 jobs are lined up. Initially, 40 subjobs are given to the 40 processors on the node. When one of the processors is done with its assigned subjob, it will get a next subjob instead of sitting idle until the other processors are done. While you would expect that on average this script should take 10 hours (each processor on average has to complete two jobs of 5 hours), there's a good chance that one of the processors gets two jobs that take more than 5 hours, so the job script requests 12 hours to be safe. How much more time you should ask for in practice depends on the spread in run times of the separate jobs.

Serial jobs of varying duration

The script above works, and can be extended to more subjobs, which is especially important if you have to do a lot (100+) of relatively short serial runs of which the walltime varies. But it gets tedious to write out all the cases. You could write a script to automate this, but you do not have to, becayse GNU Parallel already has ways of generating subjobs, as we will show below.

GNU Parallel can also keep track of the subjobs with succeeded, failed, or never started. For that, you just add --joblog to the parallel command followed by a filename to which to write the status:

# EXECUTION COMMAND - DON'T USE THIS ONE
parallel --joblog subjobs.log -j $SLURM_TASKS_PER_NODE <<EOF
  cd serialjobdir01 && ./doserialjob01
  cd serialjobdir02 && ./doserialjob02
  ...
  cd serialjobdir80 && ./doserialjob80
EOF

The joblog can also be used to retry failed jobs (more below).

Second, we can generate that set of subjobs instead of writing them out by hand. The following does the trick:

# EXECUTION COMMAND 
parallel --joblog subjobs.log -j $SLURM_TASKS_PER_NODE "cd serialjobdir{} && ./doserialjob{}" ::: {01..80}

This works as follows: "cd serialjobdir{} && ./doserialjob{}" is a template command, with placeholders {}. ::: indicated that a set of parameters follows that are to be put into the template, thus generating the commands for each subjob. After the ::: we can place a space-separated set of arguments, which in this case are generated using the bash-specific construct for a range, {01..80}.

The final script now looks like this:

#!/bin/bash
# SLURM submission script for multiple serial jobs on Niagara
#
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=40
#SBATCH --time=12:00:00
#SBATCH --job-name gnu-parallel-example

# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is the directory from which the job was submitted
cd $SLURM_SUBMIT_DIR

# Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1

module load gnu-parallel/20180322  

# EXECUTION COMMAND 
parallel --joblog subjobs.log -j $SLURM_TASKS_PER_NODE "cd serialjobdir{} && ./doserialjob{}" ::: {01..80}

Notes:

As before, GNU Parallel keeps 40 jobs running at a time, and if one finishes, starts the next. This is an easy way to do load balancing.
Doing many serial jobs often entails doing many disk reads and writes, which can be detrimental to the performance. In that case, running from the ramdisk may be an option.
When using a ramdisk, make sure you copy your results from the ramdisk back to the scratch after the runs, or when the job is killed because time has run out.
More details on how to setup your script to use the ramdisk can be found on the Ramdisk wiki page.
This script optimizes resource utility, but can only use 1 node (40 cores) at a time. The next section addresses how to use more nodes.
While on the command line, the option "--bar" can be nice to see the progress, when running as a job, you would not see this status bar.
The --joblog parameter also keeps track of failed or unfinished jobs, so you can later try to redo those with the same command, but with the option "--resume" added.

Version for more than 1 node at once

If you have many hundreds of serial jobs that you want to run concurrently and the nodes are available, then the approach above, while useful, would require tens of scripts to be submitted separately. Alternatively, it is possible to request more than one node and to use the following routine to distribute your processes amongst the cores.

#!/bin/bash
# SLURM submission script for multiple serial jobs on multiple Niagara nodes
#
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=40
#SBATCH --time=12:00:00
#SBATCH --job-name gnu-parallel-multinode-example
 
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is the directory from which the job was submitted
cd $SLURM_SUBMIT_DIR
 
# Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1
 
module load gnu-parallel/20180322 

HOSTLIST=$(scontrol show hostnames $SLURM_NODELIST | tr '\n' ,)

parallel --joblog subjobs.log -j $SLURM_TASKS_PER_NODE -S $HOSTLIST --workdir $PWD "cd serialjobdir{} && ./doserialjob{}" ::: {001..800}

The parameter -S $HOSTLIST divides the work over different nodes. $HOSTLIST is a comma separated list of the node names. These node names are also stored in $SLURM_NODELIST, but with a syntax that allows for ranges, which GNU parallel does not understand. The scontrol command fixes that.
Alternatively, GNU Parallel can be passed a file with the list of nodes to which to ssh, using --sshloginfile.
The parameter -j $SLURM_TASKS_PER_NODE tells parallel to run 40 subjobs simultaneously on each of the nodes.
The --workdir $PWD sets the working directory on the other nodes to the working directory on the first node. The --workdir argument is essential: without this, the run tries to start from the wrong place and will most likely fail.
If you need an environment variable to be transfered from the job script to the remotely running subjobs, use the --env ENVIRONMENTVARIABLE argument for the parallel command.

Of course, this is just an example of what you could do with gnu parallel. How you set up your specific run depends on how each of the runs would be started. One could for instance also prepare a file of commands to run and make that the input to parallel as well.

Submitting several bunches to single nodes, as in the section above, is a more fail-safe way of proceeding, since a node failure would only affect one of these bunches, rather than all runs.

We reiterate that if memory requirements allow it, you should try to run more than 40 jobs at once, with a maximum of 80 jobs. The way the above example job script are written, you simple change #SBATCH --ntasks-per-node=40 to #SBATCH --ntasks-per-node=80 to accomplish this.

More on GNU parallel

The documentation for GNU parallel can be found at http://www.gnu.org/software/parallel/
Its man page can be found here http://www.gnu.org/software/parallel/man.html
The man page is also available on the Niagara when the gnu-parallel module is loaded, with the command $ man parallel. The man page contains options, such as how to make sure the output is not all scrambled, and examples.

GNU Parallel Reference

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.