Trillium Neptune Nodes
| Trillium Neptune Nodes | |
|---|---|
| Installed | March 2023 |
| Operating System | Linux (Rocky 9.2) |
| Number of Nodes | 40 (3,200 cores) |
| Interconnect | Infiniband+ |
| Ram/Node | 484 GiB / 520 GB |
| Cores/Node | 80 (160 hyperthreads) |
| Login/Devel Node | trillium.scinet.utoronto.ca |
| Vendor Compilers | icc (C) ifort (fortran) icpc (C++) |
| Queue Submission | Slurm |
Specifications
The Trillium Neptune Nodes are a special partition of the Trillium cluster for dedicated projects administered by SciNet.
Each node of the cluster has 484 GiB / 520 GB RAM per node (about 6 GiB/core for jobs and roughly 475 GiB/node). It has a fast interconnect consisting of HDR InfiniBand network that is part of Trilliums's overall network topology. An interesting technical aspect is that these were the first nodes at SciNet that were entirely liquid-cooled. The Neptune nodes are all compute nodes that can only be accessed through a queueing system that allows jobs with a minimum of 15 minutes and a maximum of 24 hours and favours large jobs. Jobs should be submitted from the Trillium (CPU) login nodes.
Login, Storage, and Software
Access to these resources is not open to general users of Trillium or of other CC resource. For those that do have access, the integration with Trillium and its file system means that we can refer to the Trillium Quickstart for:
- Logging in
- The directory and file system structure
- Moving data to Trillium
- Loading software modules
- Available compilers and interpreters
Some groups of users of these nodes may be offered a different route to login and submit jobs instead of the Trillium login nodes.
The main differences with the regular Trillium nodes lie in the architecture of the CPUs (Intel vs AMD), the number of cores (40 vs. 192 on Trillium), and hyperthreading (which is enabled on neptune nodes but not on the Trillium nodes). Furthermore, there are differences on how to test, to debug, and to submit jobs to this partition, which will be explained below.
Testing and Debugging
You really should test your code before you submit it to the cluster to know if your code is correct and what kind of resources you need.
- Small test jobs can be run on the Trillium login nodes. Rule of thumb: tests should run no more than a couple of minutes, taking at most about 1-2GB of memory, and use no more than a couple of cores.
- You can run the DDT debugger on the login nodes after
module load ddt-cpu. - Short tests that do not fit on a login node, or for which you need a dedicated node, request an interactive debug job with the debug command:
tri-login06:~$ debugjob_neptune N
where N is the number of nodes, If N=1, this gives an interactive session one 1 hour, when N=2 (the maximum), it gives you 45 minutes.
- Finally, if your debugjob process has to take more than 1 hour, you can request an interactive job from the regular queue using the salloc command. Note, however, that this may take some time to run, since it will be part of the regular queue, and will be run when the scheduler decides.
tri-login06:~$ salloc --nodes N --time=M:00:00 --x11 -p compute_neptune --qos neptune
where N is again the number of nodes, and M is the number of hours you wish the job to run. The --x11 is required if you need to use graphics while testing your code through salloc, e.g. when using a debugger such as DDT.
Submitting jobs
Once you have compiled and tested your code or workflow on a login node, and confirmed that it behaves correctly, you are ready to submit jobs to run on one or more of the 40 Neptune nodes of the Trillium cluster. When and where your job runs is determined by the scheduler.
Trillium uses SLURM as its job scheduler. More-advanced details of how to interact with the scheduler can be found on the Slurm page.
You submit jobs from a login node by passing a script to the sbatch command:
tri-login06:scratch$ sbatch jobscript.sh
This puts the job in the queue. It will run on the compute nodes in due course. Note that you must submit your job from a login node. You cannot submit jobs from the datamover nodes.
In most cases, you should not submit from your $HOME directory, but rather, from your $SCRATCH directory, so that the output of your compute job can be written out (as mentioned above, $HOME is read-only on the compute nodes).
Some example job scripts can be found below.
Keep in mind:
- Scheduling is by node, so in multiples of 80 cores.
- Your job's maximum walltime is 24 hours.
- Jobs must write their output to your scratch or project directory (home is read-only on compute nodes).
- Compute nodes have no internet access.
- Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below).
Scheduling by Node
On many systems that use SLURM, the scheduler will deduce from the specifications of the number of tasks and the number of cpus-per-node what resources should be allocated. On Trillium things are a bit different.
- All job resource requests on Trillium are scheduled as a multiple of nodes.
- The nodes that your jobs run on are exclusively yours, for as long as the job is running on them.
- No other users are running anything on them.
- You can SSH into them to see how things are going.
- Whatever your requests to the scheduler, it will always be translated into a multiple of nodes allocated to your job.
- Memory requests to the scheduler are of no use. Your job always gets N x 520GB of RAM, where N is the number of nodes and 520GB is the amount of memory on the node.
- If you run serial jobs you should still aim to use all 80 cores on the node. Visit the serial jobs page for examples of how to do this on Trillium.
- Since there are 80 cores per node, your job should use N x 80 cores. You can contact us to get assistance.
Limits
There are safeguard limits to the size and duration of jobs, the number of jobs you can run and the number of jobs you can have queued.
| Usage | QOS | Partition | Limit on Running jobs | Limit on Submitted jobs (incl. running) | Min. size of jobs | Max. size of jobs | Min. walltime | Max. walltime |
|---|---|---|---|---|---|---|---|---|
| Compute jobs | neptune | compute_neptune | TDB | TDB | 1 node (80 cores) | 20 nodes (1600 cores) | 15 minutes | 24 hours |
| Testing or troubleshooting | neptune | debug_neptune | 1 | 1 | 1 node (80 cores) | 2 nodes (160 cores) | N/A | 1 hour |
Even if you respect these limits, your jobs may still have to wait in the queue if the nodes are busy.
Example submission script (MPI)
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=80 #SBATCH --time=1:00:00 #SBATCH -p compute_neptune #SBATCH --qos neptune #SBATCH --job-name=mpi_job #SBATCH --output=mpi_output_%j.txt #SBATCH --mail-type=FAIL module load intel/2025.2.0 module load openmpi/5.0.8 source /scinet/vast/etc/vastpreload-openmpi.bash # important if doing MPI-IO mpirun ./mpi_example # do not use "srun"
Submit this script from your scratch directory with the command:
tri-login06:scratch$ sbatch mpi_job.sh
- First line indicates that this is a bash script.
- Lines starting with
#SBATCHgo to SLURM. - sbatch reads these lines as a job request (which it gives the name
mpi_job) - In this case, SLURM looks for 2 nodes each running 80 tasks (for a total of 80 tasks), for 1 hour
- These nodes must be in the compute_neptune partition, and in the neptune qos (quality-of-service). Note that both the -p and the --qos options must be specified.
- Once it found such a node, it runs the script:
- Change to the submission directory;
- Loads modules;
- Preloads the MPI-IO library;
- Runs the
mpi_exampleapplication (SLURM will inform mpirun on how many processes to run).
- To use hyperthreading, just change
--ntasks-per-node=80to--ntasks-per-node=160, and add--bind-to noneto the mpirun command (the latter is necessary for OpenMPI only, not when using IntelMPI).
Example submission script (OpenMP)
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=80 #SBATCH --time=1:00:00 #SBATCH -p compute_neptune #SBATCH --qos neptune #SBATCH --job-name=openmp_job #SBATCH --output=openmp_output_%j.txt #SBATCH --mail-type=FAIL module load intel/2025 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ./openmp_example
Submit this script from your scratch directory with the command:
tri-login06:~$ sbatch openmp_job.sh
- First line indicates that this is a bash script.
- Lines starting with
#SBATCHgo to SLURM. - sbatch reads these lines as a job request (which it gives the name
openmp_job) . - In this case, SLURM looks for one node with 40 cores to be run inside one task, for 1 hour.
- Once it found such a node, it runs the script:
- Change to the submission directory;
- Loads the compiler module;
- Sets an environment variable;
- Runs the
openmp_exampleapplication.
- To use hyperthreading, just change
--cpus-per-task=80to--cpus-per-task=160.
Monitoring queued jobs
Once the job is incorporated into the queue, there are some commands you can use to monitor its progress.
squeueto show the job queue (squeue -u $USERfor just your jobs);squeue -j JOBIDto get information on a specific job(alternatively,
scontrol show job JOBID, which is more verbose).squeue --start -j JOBIDto get an estimate for when a job will run; these tend not to be very accurate predictions.scancel -i JOBIDto cancel the job.jobperf JOBIDto get an instantaneous view of the cpu and memory usage of the nodes of the job while it is running.sacctto get information on your recent jobs.
Further instructions for monitoring your jobs can be found on the Slurm page. The my.SciNet site is also a very useful tool for monitoring your current and past usage.