Niagara Neptune Nodes

From SciNet Users Documentation
Jump to navigation Jump to search
Niagara Neptune Nodes
Neptunerack.jpg
Installed March 2023
Operating System CentOS 7.9
Number of Nodes 40 (3,200 cores)
Interconnect Mellanox Dragonfly+
Ram/Node 484 GiB / 520 GB
Cores/Node 80 (160 hyperthreads)
Login/Devel Node niagara.scinet.utoronto.ca
Vendor Compilers icc (C) ifort (fortran) icpc (C++)
Queue Submission Slurm

Specifications

The Niagara Neptune Nodes are a special partition of the Niagara cluster for dedicated projects administered by SciNet.

Each node of the cluster has 484 GiB / 520 GB RAM per node (about 6 GiB/core for jobs and roughly 475 GiB/node). It has a fast interconnect consisting of HDR InfiniBand network that is part of Niagara's overall Dragonfly+ topology with Adaptive Routing. An interesting technical aspect is that these nodes are entirely water-cooled. The Neptune nodes are all compute nodes that can only be accessed through a queueing system that allows jobs with a minimum of 15 minutes and a maximum of 24 hours and favours large jobs. Jobs should be submitted from the Niagara login nodes.

Login, Storage, and Software

Access to these resources is not open to general users of Niagara or of other CC resource. For those that do have access, the integration with Niagara and its file system means that we can refer to the Niagara Quickstart for:

  • Logging in
  • The directory and file system structure
  • Moving data to Niagara
  • Loading software modules
  • Software stacks: NiaEnv and CCEnv
  • Available compilers and interpreters

Some groups of users of these nodes may be offered a different route to login and submit jobs instead of the Niagara login nodes.

The main differences with the regular Niagara nodes lie on how to test and debug and how to submit jobs to this partition, which will be explained below.

Testing and Debugging

You really should test your code before you submit it to the cluster to know if your code is correct and what kind of resources you need.

  • Small test jobs can be run on the Niagara login nodes. Rule of thumb: tests should run no more than a couple of minutes, taking at most about 1-2GB of memory, and use no more than a couple of cores.
  • You can run the DDT debugger on the login nodes after module load ddt.
  • Short tests that do not fit on a login node, or for which you need a dedicated node, request an interactive debug job with the debug command:
nia-login07:~$ debugjob_neptune --clean N

where N is the number of nodes, If N=1, this gives an interactive session one 1 hour, when N=2 (the maximum), it gives you 45 minutes. The --clean argument is optional but recommended as it will start the session without any modules loaded, thus mimicking more closely what happens when you submit a job script.

Finally, if your debugjob process has to take more than 1 hour, you can request an interactive job from the regular queue using the salloc command. Note, however, that this may take some time to run, since it will be part of the regular queue, and will be run when the scheduler decides.

nia-login07:~$ salloc --nodes N --time=M:00:00 --x11 -p compute_neptune --qos neptune

where N is again the number of nodes, and M is the number of hours you wish the job to run. The --x11 is required if you need to use graphics while testing your code through salloc, e.g. when using a debugger such as DDT or DDD, See the Testing with graphics page for the options in that case.

Submitting jobs

Once you have compiled and tested your code or workflow on a login node, and confirmed that it behaves correctly, you are ready to submit jobs to run on one or more of the 40 Neptune nodes of the Niagara cluster. When and where your job runs is determined by the scheduler.

Niagara uses SLURM as its job scheduler. More-advanced details of how to interact with the scheduler can be found on the Slurm page.

You submit jobs from a login node by passing a script to the sbatch command:

nia-login07:scratch$ sbatch jobscript.sh

This puts the job in the queue. It will run on the compute nodes in due course. Note that you must submit your job from a login node. You cannot submit jobs from the datamover nodes.

In most cases, you should not submit from your $HOME directory, but rather, from your $SCRATCH directory, so that the output of your compute job can be written out (as mentioned above, $HOME is read-only on the compute nodes).

Some example job scripts can be found below.

Keep in mind:

  • Scheduling is by node, so in multiples of 80 cores.
  • Your job's maximum walltime is 24 hours.
  • Jobs must write their output to your scratch or project directory (home is read-only on compute nodes).
  • Compute nodes have no internet access.
  • Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below).
  • Move your data to the Niagara file systems before you submit your job.

Scheduling by Node

On many systems that use SLURM, the scheduler will deduce from the specifications of the number of tasks and the number of cpus-per-node what resources should be allocated. On Niagara things are a bit different.

  • All job resource requests on Niagara are scheduled as a multiple of nodes.
  • The nodes that your jobs run on are exclusively yours, for as long as the job is running on them.
    • No other users are running anything on them.
    • You can SSH into them to see how things are going.
  • Whatever your requests to the scheduler, it will always be translated into a multiple of nodes allocated to your job.
  • Memory requests to the scheduler are of no use. Your job always gets N x 520GB of RAM, where N is the number of nodes and 520GB is the amount of memory on the node.
  • If you run serial jobs you should still aim to use all 80 cores on the node. Visit the serial jobs page for examples of how to do this on Niagara.
  • Since there are 80 cores per node, your job should use N x 80 cores. You can contact us to get assistance.

Limits

There are safeguard limits to the size and duration of jobs, the number of jobs you can run and the number of jobs you can have queued.

Usage QOS Partition Limit on Running jobs Limit on Submitted jobs (incl. running) Min. size of jobs Max. size of jobs Min. walltime Max. walltime
Compute jobs neptune compute_neptune TDB TDB 1 node (80 cores) 20 nodes (1600 cores) 15 minutes 24 hours
Testing or troubleshooting neptune debug_neptune 1 1 1 node (80 cores) 2 nodes (160 cores) N/A 1 hour

Even if you respect these limits, your jobs may still have to wait in the queue if the nodes are busy.

Example submission script (MPI)

#!/bin/bash 
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=80
#SBATCH --time=1:00:00
#SBATCH -p compute_neptune
#SBATCH --qos neptune
#SBATCH --job-name=mpi_job
#SBATCH --output=mpi_output_%j.txt
#SBATCH --mail-type=FAIL

module load NiaEnv/2022a
module load intel/2022u2
module load openmpi/4.1.4+ucx-1.11.2

mpirun ./mpi_example
# or "srun ./mpi_example"

Submit this script from your scratch directory with the command:

   nia-login07:scratch$ sbatch mpi_job.sh
  • First line indicates that this is a bash script.
  • Lines starting with #SBATCH go to SLURM.
  • sbatch reads these lines as a job request (which it gives the name mpi_job)
  • In this case, SLURM looks for 2 nodes each running 80 tasks (for a total of 80 tasks), for 1 hour
  • These nodes must be in the compute_neptune partition, and in the neptune qos (quality-of-service). Note that both the -p and the --qos options must be specified.
  • Once it found such a node, it runs the script:
    • Change to the submission directory;
    • Loads modules;
    • Runs the mpi_example application (SLURM will inform mpirun or srun on how many processes to run).
  • To use hyperthreading, just change --ntasks-per-node=80 to --ntasks-per-node=160, and add --bind-to none to the mpirun command (the latter is necessary for OpenMPI only, not when using IntelMPI).

Example submission script (OpenMP)

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=80
#SBATCH --time=1:00:00
#SBATCH -p compute_neptune
#SBATCH --qos neptune
#SBATCH --job-name=openmp_job
#SBATCH --output=openmp_output_%j.txt
#SBATCH --mail-type=FAIL

module load NiaEnv/2022a
module load intel/2022u2

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./openmp_example
# or "srun ./openmp_example".

Submit this script from your scratch directory with the command:

   nia-login07:~$ sbatch openmp_job.sh
  • First line indicates that this is a bash script.
  • Lines starting with #SBATCH go to SLURM.
  • sbatch reads these lines as a job request (which it gives the name openmp_job) .
  • In this case, SLURM looks for one node with 40 cores to be run inside one task, for 1 hour.
  • These nodes must be in the compute_neptune partition, and in the neptune qos (quality-of-service). Note that both the -p and the --qos options must be specified.
    • Once it found such a node, it runs the script:
      • Change to the submission directory;
      • Loads modules;
      • Sets an environment variable;
      • Runs the openmp_example application.
    • To use hyperthreading, just change --cpus-per-task=80 to --cpus-per-task=160.

    Monitoring queued jobs

    Once the job is incorporated into the queue, there are some commands you can use to monitor its progress.

    • squeue (a caching version of squeue) to show the job queue (squeue -u $USER for just your jobs);

    • squeue -j JOBID to get information on a specific job

      (alternatively, scontrol show job JOBID, which is more verbose).

    • squeue --start -j JOBID to get an estimate for when a job will run; these tend not to be very accurate predictions.

    • scancel -i JOBID to cancel the job.

    • jobperf JOBID to get an instantaneous view of the cpu and memory usage of the nodes of the job while it is running.

    • sacct to get information on your recent jobs.

    Further instructions for monitoring your jobs can be found on the Slurm page. The my.SciNet site is also a very useful tool for monitoring your current and past usage.

    Support