Teach
Teach Cluster | |
---|---|
Installed | (orig Mar 2020), Feb 2025 |
Operating System | Linux (Rocky 9.5) |
Number of Nodes | 8 |
Interconnect | Infiniband (EDR) |
Ram/Node | 188 GiB / 202 GB |
Cores/Node | 40 (80 hyperthreads) |
Login/Devel Node | login-teach01 login-teach02 |
Vendor Compilers | gcc |
Queue Submission | slurm |
Teaching Cluster
SciNet has assembled some older compute hardware into a small cluster provided primarily for teaching purposes. It is configured similarly to the coming SciNet production systems Trillium, however it uses hardware repurposed from its predecessor, Niagara. This system should not be used for production work as such the queuing policies are designed to provide fast job turnover and limit the amount of resources one person can use at a time. Questions about its use or problems should be sent to support@scinet.utoronto.ca.
Specifications
The cluster currently consists of 8 repurposed x86_64 nodes each with 40 cores (from two 20 core Intel CascadeLake CPUs) running at 2.5GHz with 188GB of RAM per node. The nodes support hyperthreading, so every physical core presents itself as two logical cores. The nodes are interconnected with 1:1 non-blocking EDR Infiniband for MPI communications, and disk I/O to a separate view of the VAST file system. In total this cluster contains 320 cores.
Login/Devel Node
Teach runs Rocky Linux 9. You will need to be somewhat familiar with Linux systems to work on Teach. If you are not, it will be worth your time to review our Introduction to Linux Shell class.
As with all SciNet and Alliance (formerly Compute Canada) systems, access to Teach is done via SSH (secure shell) only. Open a terminal window (e.g. using PuTTY or MobaXTerm on Windows), and type
ssh -Y USERNAME@teach.scinet.utoronto.ca
This will bring directly to the command line of teach-login01 or teach-login02 the gateway/devel nodes for this cluster. On teach-login0? you can compile, do short tests, and submit your jobs to the queue.
The login nodes are shared between students of a number of different courses. Use this node to develop and compile code, to run short tests, and to submit computations to the scheduler (see below).
Note that access to the teach cluster is restricted to temporary accounts that start with the prefix 'lcl_uot' + the course number + 's', and a number. Passwords for these accounts can be changed on the SciNet user portal.
Software Modules
Other than essentials, all installed software is made available using module commands. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be found on the modules page.
The Teach cluster makes the same modules available as on the General Purpose clusters of the Digital Research Alliance of Canada, with one caveat. On Teach, by default, only the "gentoo" module is loaded, which provides basic OS-level functionality.
Common module subcommands are:
module load <module-name>
: load the default version of a particular software.module load <module-name>/<module-version>
: load a specific version of a particular software.module purge
: unload all currently loaded modules.module spider
(ormodule spider <module-name>
): list available software packages.module avail
: list loadable software packages.module list
: list loaded modules.
For example, to make the GNU compilers (gcc, g++ and gfortran) available, you should type
module load gcc
while the Intel compilers (icc, icpc and ifort) can be loaded by
module load intel
To get the default modules that loaded on the General Purpose clusters, you must load the "StdEnv" module.
Along with modifying common environment variables, such as the PATH, these modules also create an EBROOT<MODULENAME> environment variable, which can be used to access commonly needed software directories, such as /include and /lib.
There are handy abbreviations for the module commands. ml
is the same as module list
, and ml <module-name>
is the same as module load <module-name>
.
A list of available software modules can be on modules available this page.
Interactive jobs
For a interactive sessions on a compute node of the teach cluster that give access to non-shared resources, use the 'debugjob' command.
teach01:~$ debugjob -n C
where C is the number of cores. An interactive session defaults to four hours when using at most one node (C<=40), and becomes 60 minutes when using four nodes (i.e., 120<C<=160), which is the maximum number of nodes allowed for an interactive session by debugjob.
For a short interactive sessions on a dedicated compute node of the teach cluster, use the 'debugjob' command as follows:
teach01:~$ debugjob N
where N is the number of nodes. On the Teach cluster, this is equivalent to debugjob -n 40*N . The positive integer number N can at most be 4.
If no arguments are given to debugjob, it allocates a single core on a Teach compute node.
There are limits on the resources you can get with a debugjob, and how long you can get them. No debugjob can run longer than four hours or use more than 160 cores, and each user can only run one at a time. For longer computations, jobs must be submitted to the scheduler.
Submit a Job
Teach uses SLURM as its job scheduler. More-advanced details of how to interact with the scheduler can be found on the Slurm page.
You submit jobs from a login node by passing a script to the sbatch command:
teach-login01:~scratch$ sbatch jobscript.sh
This puts the job in the queue. It will run on the compute nodes in due course.
Note:
- Each teach cluster node has two CPUs with 20 cores each, a total of 40 cores per node. The nodes have hyperthreading enables, so it will seem like you're gettinf twice as many cores, but those are logical cores, not physical ones..
- Make sure to adjust accordingly the flags --ntasks-per-node or --ntasks together with --nodes for the examples found at Slurm page.
- The current slurm configuration of the teach cluster allocates compute resources by core as opposed to by node. That means your tasks might land on nodes that have other jobs running, i.e. they might share the node. If you want to avoid that, make sure to add the following directive in your submitting script: #SBATCH --exclusive. This forces your job to use the compute nodes exclusively.
- The maximum walltime is currently set to 4 hours.
- There are 2 queues available: Compute queue and debug queue. Their usage limits are listed on the table below.
Limits
There are limits to the size and duration of your jobs, the number of jobs you can run and the number of jobs you can have queued. It also matters in which 'partition' the jobs runs. 'Partitions' are SLURM-speak for use cases. You specify the partition with the -p parameter to sbatch or salloc, but if you do not specify one, your job will run in the compute partition, which is the most common case.
Usage | Partition | Running jobs | Submitted jobs (incl. running) | Min. size of jobs | Max. size of jobs | Min. walltime | Max. walltime |
---|---|---|---|---|---|---|---|
Interactive testing or troubleshooting | debug | 1 | 1 | 1 core | 4 nodes (160 cores) | N/A | 4 hours |
Compute jobs | compute | 1 | 12 | 1 core | 4 nodes (160 cores) | 15 minutes | 4 hours |
Within these limits, jobs may still have to wait in the queue. Although there are no allocations on the teach cluster, the waiting time still depends on many factors, such as the number of nodes and the walltime, how many other jobs are waiting in the queue, and whether a job can fill an otherwise unused spot in the schedule.