Trillium Quickstart

From SciNet Users Documentation
Revision as of 18:20, 13 August 2025 by Afedosee (talk | contribs) (Removed duplicates)
Jump to navigation Jump to search

This Quickstart Guide is a work in progress. Details may change as Trillium documentation is updated.


Trillium
Trillium.jpg
Installed Aug 2025
Operating System Rocky Linux 9.6
Number of Nodes 1284 nodes (240768 cores)
Interconnect Mellanox Dragonfly+
Ram/Node 768 GB
Cores/Node 192 (CPU nodes) and 96 (GPU nodes)
Login/Devel Node trillium.scinet.utoronto.ca, trillium-gpu.scinet.utoronto.ca
Queue Submission Slurm

System Overview

The Trillium system is a state-of-the-art high performance computing platform, consisting of three main components:

1. CPU Subcluster

  • ~240,000 cores across homogeneous CPU nodes
  • Non-blocking 400 Gb/s NDR InfiniBand interconnect
  • Ideal for large-scale parallel workloads

2. GPU Subcluster

  • 61 GPU nodes, each with 4 x NVIDIA H100 (SXM) GPUs
  • 800 Gb/s bandwidth per node (200 Gb/s per GPU) over InfiniBand
  • Optimized for AI/ML and accelerated science workloads
  • Note: This subcluster is in high demand and not ideal for training extremely large models (multi-100B parameters)
  • To access, SSH into trillium-gpu.scinet.utoronto.ca from outside, or to trig-login01 from other Trillium nodes.

3. Storage System

  • Unified 29 PB VAST NVMe storage for all workloads
  • No tiering — all flash-based for consistent performance
  • Accessible via POSIX or S3 under a unified namespace

Specifications

The Trillium cluster is a large cluster comprised of two types of nodes:

Nodes Cores Available memory CPU GPU
1224 192 768GB DDR5 2 x AMD EPYC 9655 (Zen 5) @ 2.6 GHz, 384MB cache L3
60 96 768GB DDR5 1 x AMD EPYC 9654 (Zen 4) @ 2.4 GHz, 384MB cache L3 4 x NVIDIA H100 SXM (80 GB memory)

Each node of the cluster has 768 GB RAM per node. Being designed for large parallel workloads, it has a fast interconnect consisting of NDR InfiniBand in a Dragonfly+ topology with Adaptive Routing. The compute nodes are accessed through a queueing system that allows jobs with a minimum of 15 minutes and a maximum of 24 hours.

Storage System

Trillium features a unified high-performance storage system based on the VAST platform, with no tiering. It serves the following directories:

  • /home – For personal files and configurations.
  • /scratch – High-speed, temporary storage for job data.
  • /project – Shared storage for project teams and collaborations.

The storage is accessible via the NDR InfiniBand fabric for maximum performance across all workloads.

Getting started on Trillium

Access to Trillium is not enabled automatically for everyone with an account with the Digital Reseach Alliance of Canada (formerly Compute Canada), but anyone with an active Alliance account can get their access enabled.

Trillium is not automatically available to all Alliance account holders. If you are new to SciNet or your Supervisor/PI does not hold a current Alliance (formerly Compute Canada) RAC allocation, you will need to request access on the Access Systems page on the CCDB site. After clicking the "I request access" button, it usually takes only one or two business days for access to be granted.

You can check if you already have Trillium access by attempting to log in. If you receive a "Permission denied" error (and your SSH key is correctly set up), you may need to opt in.

Please read this document carefully. The FAQ is also a useful resource. If at any time you require assistance, or if something is unclear, please do not hesitate to contact us.

Logging in

Trillium runs Rocky Linux 9.6, which is a type of Linux. You will need to be familiar with Linux systems to work on Trillium. If you are not it will be worth your time to review our Introduction to Linux Shell class.

As with all SciNet and Alliance (formerly Compute Canada) compute systems, access to Trillium is done via SSH (secure shell) only and authentication is only allowed via SSH keys. Please refer to this page to generate your SSH key pair and make sure you use them securely.

Open a terminal window (e.g. Connecting with PuTTY on Windows or Connecting with MobaXTerm), then SSH into the Trillium login nodes with your Alliance (formerly Compute Canada) credentials:

$ ssh -i /path/to/ssh_private_key -Y MYALLIANCEUSERNAME@trillium.scinet.utoronto.ca
  • The Trillium login nodes are where you develop, edit, compile, prepare and submit jobs.
  • These login nodes are not part of the Trillium compute cluster, but have the same architecture, operating system, and software stack.
  • The optional -Y enables X11 forwarding, allowing graphical programs to open windows on your local computer.
  • To run on Trillium compute nodes, you must submit a batch job.

If you cannot log in, be sure to first check the System Status on this site's front page.

Note: We plan to add browser access to Trillium via Open OnDemand in the future. In the meantime you can still access our existing Open OnDemand deployment by following the instructions in our quickstart guide.

Your storage locations

On Trillium, every user has several types of storage space available. These locations each serve different purposes, and the exact paths depend on your username and group. For convenience and portability, each location is also available through a corresponding environment variable.

Home and Scratch

You have a home directory and a scratch directory. Their locations are stored in the environment variables $HOME and $SCRATCH.

On Trillium, the paths follow the naming convention:

$HOME=/home/username
$SCRATCH=/scratch/username

For example:

 tri-login01:~$ pwd
 /home/yourusername
 tri-login01:~$ cd $SCRATCH
 tri-login01:scratch$ pwd
 /scratch/yourusername

NOTE: The home directory is read-only on compute nodes.

Software Environment

Trillium uses the environment modules system to manage compilers, libraries, and other software packages. Modules dynamically modify your environment (e.g., PATH, LD_LIBRARY_PATH) so you can access different versions of software without conflicts.

A detailed explanation can be found on the modules page.

Commonly used module commands:

  • module load <module-name> – Load the default version of a software package.
  • module load <module-name>/<module-version> – Load a specific version.
  • module purge – Unload all currently loaded modules.
  • module avail – List available modules that can be loaded.
  • module list – Show currently loaded modules.
  • module spider or module spider <module-name> – Search for available modules and their versions.

Handy abbreviations are available:

  • ml – Equivalent to module list.
  • ml <module-name> – Equivalent to module load <module-name>.

Tips for Loading Software

Properly managing your software environment is key to avoiding conflicts and ensuring reproducibility. Here are some best practices:

  • Avoid loading modules in your .bashrc file. Doing so can cause unexpected behavior, particularly in non-interactive environments like batch jobs or remote shells. For more information, see our .bashrc guidelines.
  • Instead, load modules manually or from a separate script. This approach gives you more control and helps keep environments clean.
  • Load required modules inside your job submission script. This ensures that your job runs with the expected software environment, regardless of your interactive shell settings.
  • Be explicit about module versions. Short names like gcc will load the system default (e.g., gcc/12.3), which may change in the future. Specify full versions (e.g., gcc/13.3) for long-term reproducibility.
  • Resolve dependencies with module spider. Some modules depend on others. Use module spider <module-name> to discover which modules are required and how to load them in the correct order. For more, see Using module spider.

Using Commercial Software

You may be able to use commercial software on Trillium, but there are a few important considerations:

  • Bring your own license. You can use commercial software on Trillium if you have a valid license. If the software requires a license server, you can connect to it securely using SSH tunneling.
  • SciNet and the Alliance (formerly Compute Canada) do not provide user-specific licenses. Due to the large and diverse user base, we cannot provide licenses for individual or specialized commercial packages.
  • Freely available commercial tools. Some widely useful commercial tools are available system-wide, such as compilers, math libraries, debuggers.
  • Software not available (unless you bring your own license): tools like MATLAB, Gaussian, and IDL are not provided centrally. If you have your own license, you are welcome to install and use them.
  • Open-source alternatives are available. Consider using freely available tools such as Python, R, and Octave, which are well-supported and widely used on the system.
  • We're here to help. If you have a valid license and need help installing commercial software, feel free to contact us, we'll assist where possible.

A list of commercial software currently installed on Trillium (for which you must supply a license to use) is available on the Commercial Software page.

Technical Details

Cooling and Energy Efficiency

Trillium is fully direct liquid cooled using warm water (35–40 °C input), resulting in:

  • PUE below 1.03 (high energy efficiency)
  • Use of closed-loop dry fluid coolers, avoiding evaporative towers and new water usage
  • Heat reuse: Trillium supplies excess heat to nearby facilities to minimize climate impact

Storage System

The VAST high-performance file system is comprised of a unified 29 PB NVMe-backed storage pool, with:

  • 29 PB effective capacity (deduplicated via VAST)
  • 16.7 PB raw flash capacity
  • 714 GB/s read bandwidth, 275 GB/s write bandwidth
  • 10 million read IOPS, 2 million write IOPS
  • POSIX and S3 access protocols under a unified namespace
  • 48 C-Boxes and 14 D-Boxes for data services

Backup and Archive Storage

An additional 114 PB HPSS tape-based archive is available for nearline storage:

  • Dual-copy archive across geographically separate libraries
  • Used for both backup and archival purposes
  • Backups are managed using Atempo backup software

Testing and Debugging

Before submitting your job to the cluster, it's important to test your code to ensure correctness and determine the resources it requires.

  • Lightweight tests can be run directly on the login nodes. As a rule of thumb, these should:
    • Run in under a few minutes
    • Use no more than 1–2 GB of memory
    • Use only 1–2 CPU cores
  • You can also run the DDT debugger on the login nodes after loading it with: module load ddt-cpu
  • For short tests that exceed login node limits or require dedicated resources, request an interactive debug job using the debugjob command:
tri-login01:~$ debugjob --clean N

Replace N with the number of nodes (1 to 4). If N=1, you will get 1 hour of interactive time; with N=4 (the maximum), you will get 22 minutes. The --clean flag is optional but recommended, as it starts the session with no modules loaded, better mimicking the clean environment of batch jobs.

  • If your test job requires more time than allowed by debugjob, you can request an interactive session from the regular queue using salloc:
tri-login01:~$ salloc --nodes=N --time=M:00:00 --x11
  • N is the number of nodes
  • M is the number of hours the job should run
  • --x11 is required for graphical applications (e.g., when using DDT or DDD)

Note: Jobs submitted with salloc may take longer to start, as they are scheduled like any other batch job. See the Testing with graphics page for more information on graphical testing options.

Submitting Jobs on the CPU Subcluster

Once you have compiled and tested your code or workflow on the Trillium login nodes and confirmed that it behaves correctly, you are ready to submit jobs to the cluster. These jobs will run on Trillium's compute nodes, and their execution is managed by the SLURM scheduler.

Trillium uses SLURM as its job scheduler. More advanced details of how to interact with the scheduler can be found on the Slurm page.

To submit a job, use the sbatch command on a login node:

tri-login01:scratch$ sbatch jobscript.sh

This places your job into the queue. It will begin execution on available compute nodes when scheduled. Note: jobs must be submitted from a login node, submitting from datamover nodes is not allowed.

In most cases, you should submit jobs from your $SCRATCH directory, not $HOME, since the home directory is read-only on compute nodes. Output from your jobs must be written to $SCRATCH.

Users typically have both def and rrg accounts. Jobs will run under your group's RRG allocation, or if one is not available, under a RAS allocation (previously called the "default" allocation). Unless you explicitly specify the account using the --account=ACCOUNT_NAME option in your job script or submission command, your job will most likely be charged to the def account.

If you want your job to use the RRG allocation, be sure to specify it explicitly.

Some example job scripts are shown below.

Key Points to Remember

  • Scheduling is by node, not by core or CPU.
  • Each node has 192 cores and 768 GB of memory.
  • Jobs are limited to a maximum of 24 hours walltime.
  • Output must be written to $SCRATCH. $HOME and $PROJECT are read-only on compute nodes.
  • Compute nodes have no internet access.
  • Your job script must load all necessary modules explicitly using module load.
  • Ensure your input data is on Trillium before submitting jobs.

Scheduling by Node

On many systems that use SLURM, the scheduler will deduce from the specifications of the number of tasks and the number of CPUs per node what resources should be allocated. On Trillium, things are a bit different.

  • All job resource requests on Trillium are scheduled as a multiple of nodes.
  • The nodes that your jobs run on are exclusively yours, for as long as the job is running on them:
    • no other user can run jobs on them;
    • you can SSH into your nodes during execution to monitor progress.
  • Even if your job does not use all 192 cores, you still get the full node. Trillium does not share nodes between users.
  • Memory requests are ignored. Your job receives N × 768GB of RAM, where N is the number of nodes and 768GB is the amount of memory on each node.
  • If running serial or low-core jobs you must still use all 192 cores on the node by bundling multiple independent tasks in one job script. See this page for examples.
  • If your job underutilizes the cores, our support team may reach out to assist you in optimizing your workflow, or you can contact us to get assistance.

Limits

There are limits to the size and duration of your jobs, the number of jobs you can run, and the number of jobs you can have queued. It matters whether a user is part of a group with a Resources for Research Group allocation or not. It also matters in which "partition" the job runs. "Partitions" are SLURM-speak for use cases. You specify the partition with the -p parameter to sbatch or salloc, but if you do not specify one, your job will run in the compute partition, which is the most common case.

Usage Partition Limit on Running jobs Limit on Submitted jobs (incl. running) Min. size of jobs Max. size of jobs Min. walltime Max. walltime
Compute jobs compute 50 1000 1 node (192 cores) default: 20 nodes (3840 cores)
with allocation: 1000 nodes (192000 cores)
15 minutes 24 hours
Testing or troubleshooting debug 1 1 1 node (192 cores) 4 nodes (768 cores) N/A 1 hour
Archiving or retrieving data in HPSS archivelong 2 per user (5 in total) 10 per user N/A N/A 15 minutes 72 hours
Inspecting archived data, small archival actions in HPSS archiveshort vfsshort 2 per user 10 per user N/A N/A 15 minutes 1 hour

Even if you respect these limits, your jobs will still have to wait in the queue. The waiting time depends on many factors such as your group's allocation amount, how much allocation has been used in the recent past, the number of requested nodes and walltime, and how many other jobs are waiting in the queue.

Example submission script (MPI)

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=192
#SBATCH --time=01:00:00
#SBATCH --job-name=mpi_job
#SBATCH --output=mpi_output_%j.txt
#SBATCH --mail-type=FAIL

cd $SLURM_SUBMIT_DIR

module load StdEnv/2023
module load gcc/12.3
module load openmpi/4.1.5

mpirun ./mpi_example
# or "srun ./mpi_example"

Submit this script from your $SCRATCH directory with the command:

tri-login01:scratch$ sbatch mpi_job.sh
  • First line indicates that this is a bash script.
  • Lines starting with #SBATCH go to SLURM.
  • sbatch reads these lines as a job request (which it gives the name mpi_job).
  • In this case, SLURM looks for 2 nodes each running 192 tasks, for 1 hour.
  • Note that the mpifun flag --ppn (processors per node) is ignored.
  • Once it finds such a node, it runs the script:
    • Change to the submission directory;
    • Loads modules;
    • Runs the mpi_example application (SLURM will inform mpirun or srun how many processes to run).

Example submission script (OpenMP)

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=192
#SBATCH --time=01:00:00
#SBATCH --job-name=openmp_job
#SBATCH --output=openmp_output_%j.txt
#SBATCH --mail-type=FAIL

cd $SLURM_SUBMIT_DIR

module load StdEnv/2023
module load gcc/12.3

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./openmp_example
# or "srun ./openmp_example"

Submit this script from your $SCRATCH directory with the command:

tri-login01:scratch$ sbatch openmp_job.sh
  • First line indicates that this is a Bash script.
  • Lines starting with #SBATCH are directives for SLURM.
  • sbatch reads these lines as a job request (which it gives the name openmp_job).
  • In this case, SLURM looks for one node with 192 CPUs for a single task running up to 192 OpenMP threads, for 1 hour.
  • Once such a node is allocated, it runs the script:
    • Changes to the submission directory;
    • Loads the required modules;
    • Sets OMP_NUM_THREADS based on SLURM’s CPU allocation;
    • Runs the openmp_example application.

Monitoring Queued and Running Jobs

Once your job is submitted to the queue, you can monitor its status and performance using the following SLURM commands:

  • squeue shows all jobs in the queue. Use squeue -u $USER to view only your jobs.

  • squeue -j JOBID shows the current status of a specific job. Alternatively, use scontrol show job JOBID for detailed information, including allocated nodes, resources, and job flags.

  • squeue --start -j JOBID gives a rough estimate of when a pending job is expected to start. Note that this estimate is often inaccurate and can change depending on system load and priorities.

  • scancel JOBID cancels a job you submitted.

  • jobperf JOBID gives a live snapshot of the CPU and memory usage of your job while it is running.

  • sacct shows information about your past jobs, including start time, run time, node usage, and exit status.

More details on monitoring jobs can be found on the Slurm page.

You can also view and manage your current and past jobs, resource usage, and allocation history through the my.SciNet portal.

Submitting Jobs on the GPU Subcluster

The Trillium GPU subcluster is designed for AI/ML and accelerated science workloads. It has specific rules and resource limits that differ from the CPU subcluster.

Everything in the Submitting Jobs on the CPU Subcluster section applies here, with the following GPU-specific rules:

Requirement Details
Allowed GPU counts Jobs must request exactly 1 GPU or a multiple of 4 GPUs.
Single-GPU jobs Use --gpus-per-node=1.
Whole-node GPU jobs Use --gpus-per-node=4 and --partition=compute_full_node (or -p compute_full_node).
Multi-node GPU jobs Must request full nodes: --gpus-per-node=4 and --partition=compute_full_node.
Memory limits The --mem option is not allowed.

Per GPU: 192 GB host memory. Whole-node jobs: 768 GB total.

Accessing the GPU Subcluster

  • From outside: ssh trillium-gpu.scinet.utoronto.ca
  • From another Trillium node: ssh trig-login01

Example: Single-GPU Job

#!/bin/bash
#SBATCH --job-name=single_gpu_job         # Job name
#SBATCH --output=single_gpu_job_%j.out    # Output file (%j = job ID)
#SBATCH --nodes=1                         # Request 1 node
#SBATCH --gpus-per-node=1                 # Request 1 GPU
#SBATCH --time=00:30:00                   # Max runtime (30 minutes)

# Load modules
module load StdEnv/2023
module load cuda/12.6
module load python/3.11.5

# Activate Python environment (if applicable)
source ~/myenv/bin/activate

# Check GPU allocation
srun nvidia-smi

# Run your workload
srun python my_script.py

Example: Whole-Node (4 GPUs) Job

#!/bin/bash
#SBATCH --job-name=whole_node_gpu_job
#SBATCH --output=whole_node_gpu_job_%j.out
#SBATCH --partition=compute_full_node
#SBATCH --nodes=1
#SBATCH --gpus-per-node=4
#SBATCH --time=02:00:00

module load StdEnv/2023
module load cuda/12.6
module load python/3.11.5

# Activate Python environment (if applicable)
source ~/myenv/bin/activate

srun python my_distributed_script.py

Example: Multi-Node GPU Job

#!/bin/bash
#SBATCH --job-name=multi_node_gpu_job
#SBATCH --output=multi_node_gpu_job_%j.out
#SBATCH --nodes=2                        # Request 2 full nodes
#SBATCH --gpus-per-node=4                # 4 GPUs per node (full node)
#SBATCH --partition=compute_full_node    # Required for full-node jobs
#SBATCH --time=04:00:00

module load StdEnv/2023
module load cuda/12.6
module load openmpi/4.1.5

# Check all GPUs allocated
srun nvidia-smi

# Activate Python environment (if applicable)
source ~/myenv/bin/activate

# Example: run a distributed training job with 8 GPUs (2 nodes × 4 GPUs)
srun python train_distributed.py

Best Practices for GPU Jobs

  • Do not use --mem — memory is fixed per GPU (192 GB) or per node (768 GB).
  • Always specify GPU counts and --partition=compute_full_node for whole-node or multi-node jobs.
  • Load only the modules you need — see Using_modules.
  • Be explicit with software versions for reproducibility (e.g., cuda/12.6 rather than just cuda).
  • Test on a single GPU before scaling to multiple GPUs or nodes.
  • Monitor usage with nvidia-smi to ensure GPUs are fully utilized.