Teach

From SciNet Users Documentation
Jump to navigation Jump to search

This page describes the usage of the new Teach cluster, installed in Feb 2025. It is currently still somewhat in beta phase.

Teach Cluster
Ibm idataplex dx360 m4.jpg
Installed (orig Mar 2020), Feb 2025
Operating System Linux (Rocky 9.5)
Number of Nodes 8
Interconnect Infiniband (EDR)
Ram/Node 188 GiB / 202 GB
Cores/Node 40 (80 hyperthreads)
Login/Devel Node teach-login01
Vendor Compilers gcc
Queue Submission slurm

Teaching Cluster

SciNet has assembled some older compute hardware into a small cluster provided primarily for teaching purposes. It is configured similarly to the coming SciNet production systems Trillium, however it uses hardware repurposed from its predecessor, Niagara. This system should not be used for production work as the queuing policies are designed to provide fast job turnover and limit the amount of resources one person can use at a time. Questions about its use or problems should be sent to support@scinet.utoronto.ca.

This Teach cluster is setup differently from its predecessor. See below for the main changes.


Specifications

This cluster currently consists of 8 repurposed x86_64 nodes each with 40 cores (from two 20 core Intel CascadeLake CPUs) running at 2.5GHz with 188GB of RAM per node. The nodes support hyperthreading, so every physical core presents itself as two logical cores. The nodes are interconnected with 1:1 non-blocking EDR Infiniband for MPI communications, and disk I/O to a separate view of the VAST file system. In total, this cluster contains 320 cores, but there are plans to expand it if demand warrants it.

Login/Devel Node

Teach runs Rocky Linux 9. You will need to be somewhat familiar with Linux systems to work on Teach. If you are not, it will be worth your time to review our Introduction to Linux Shell class.

As with all SciNet and Alliance (formerly Compute Canada) systems, access to Teach is done via SSH (secure shell) only. Open a terminal window (e.g. using PuTTY or MobaXTerm on Windows), and type

ssh -Y USERNAME@teach.scinet.utoronto.ca

This will bring directly to the command line of teach-login01 or teach-login02, which are the gateway/devel nodes for this cluster. On these nodes, you can compile, do short tests, and submit your jobs to the queue.

The first time you login to Teach cluster, please make sure to check if the login node ssh key fingerprint matches. See here how.


The login nodes are shared between students of a number of different courses. Use this node to develop and compile code, to run short tests, and to submit computations to the scheduler (see below).

Note that access to the teach cluster is restricted to temporary accounts that start with the prefix lcl_uot + the course code + s, and a number. Passwords for these accounts can be changed on the SciNet user portal. On the same site, users can upload a public ssh key if they want to connect using ssh keys.

Software Modules

Other than essentials, all installed software is made available using module commands. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be found on the modules page.

The Teach cluster makes the same modules available as on the General Purpose clusters of the Digital Research Alliance of Canada, with one caveat. On Teach, by default, only the "gentoo" module is loaded, which provides basic OS-level functionality.

Common module subcommands are:

  • module load <module-name>: load the default version of a particular software.
  • module load <module-name>/<module-version>: load a specific version of a particular software.
  • module purge: unload all currently loaded modules.
  • module spider (or module spider <module-name>): list available software packages.
  • module avail: list loadable software packages.
  • module list: list loaded modules.

For example, to make the GNU compilers (gcc, g++ and gfortran) available, you should type

module load gcc

while the Intel compilers (icc, icpc and ifort) can be loaded by

module load intel

To get the default modules that are loaded on the General Purpose clusters, you can load the "StdEnv" module.

Along with modifying common environment variables, such as the PATH, these modules also create an EBROOT<MODULENAME> environment variable, which can be used to access commonly needed software directories, such as /include and /lib.

There are handy abbreviations for the module commands. ml is the same as module list, and ml <module-name> is the same as module load <module-name>.

A list of available software modules can be on found on this page.

There are a few addition modules available as well, and more can be made available upon demand of the course instructors. Currently, the only additional modules are:

rarray/2.8.0 - Reference-counted multi-dimensional arrays for C++
catch2/3.3.1 - A C++ test framework for unit-tests, TDD and BDD using C++14 and later.
misopy/0.5.2 - A probabilistic framework to analysize RNA-Seq data.

Interactive jobs

For a interactive sessions on a compute node of the teach cluster that give access to non-shared resources, use the 'debugjob' command.

teach01:~$ debugjob -n C

where C is the number of cores. An interactive session defaults to four hours when using at most one node (C<=40), and becomes 60 minutes when using four nodes (i.e., 120<C<=160), which is the maximum number of nodes allowed for an interactive session by debugjob.

For a short interactive sessions on a dedicated compute node of the teach cluster, use the 'debugjob' command as follows:

teach01:~$ debugjob N

where N is the number of nodes. On the Teach cluster, this is equivalent to debugjob -n 40*N . The positive integer number N can at most be 4.

If no arguments are given to debugjob, it allocates a single core on a Teach compute node.

There are limits on the resources you can get with a debugjob, and how long you can get them. No debugjob can run longer than four hours or use more than 160 cores, and each user can only run one at a time. For longer computations, jobs must be submitted to the scheduler.

Submit a Job

Teach uses SLURM as its job scheduler. More-advanced details of how to interact with the scheduler can be found on the Slurm page.

You submit jobs from a login node by passing a script to the sbatch command:

teach-login01:~$ sbatch jobscript.sh

This puts the job in the queue. It will run on the compute nodes in due course.

Note:

  • Each teach cluster node has two CPUs with 20 cores each, a total of 40 cores per node. The nodes have hyperthreading enabled, so it will seem like you're getting twice as many cores, but those are logical cores, not physical ones..
  • Make sure to adjust accordingly the flags --ntasks-per-node or --ntasks together with --nodes for the examples found at Slurm page.
  • The current slurm configuration of the teach cluster allocates compute resources by core as opposed to by node. That means your tasks might land on nodes that have other jobs running, i.e. they might share the node. If you want to avoid that, make sure to add the following directive in your submitting script: #SBATCH --exclusive. This forces your job to use the compute nodes exclusively.
  • The maximum wall time is currently set to 4 hours.
  • There are 2 queues available: Compute queue and debug queue. Their usage limits are listed on the table below.

Limits

There are limits to the size and duration of your jobs, the number of jobs you can run and the number of jobs you can have queued. It also matters in which 'partition' the jobs runs. 'Partitions' are SLURM-speak for use cases. You specify the partition with the -p parameter to sbatch or salloc, but if you do not specify one, your job will run in the compute partition, which is the most common case.

Usage Partition Running jobs Submitted jobs (incl. running) Min. size of jobs Max. size of jobs Min. walltime Max. walltime
Interactive testing or troubleshooting debug 1 1 1 core 4 nodes (160 cores) N/A 4 hours
Compute jobs compute 1 12 1 core 4 nodes (160 cores) 15 minutes 4 hours

Within these limits, jobs may still have to wait in the queue. Although there are no allocations on the teach cluster, the waiting time still depends on many factors, such as the number of nodes and the wall time, how many other jobs are waiting in the queue, and whether a job can fill an otherwise unused spot in the schedule.

Main changes from Teach's predecessor

Although the cluster is once again called Teach and you connect with to teach.scinet.utoronto.ca as before, the system is setup differently from the previous Teach cluster in the following ways:

  • There are now 2 dedicated login nodes, teach-login01 and teach-login02.
  • The ssh fingerprints for these login nodes can be found on Teach_fingerprints.
  • The compute nodes have 40 cores with hyperthreading enabled. As before, you can request jobs by number of cores, but it will seem as if you got twice the cores.
  • Only temporary lcl_uot.... accounts can log in.
  • Only the home directories of those account are mounted.
  • In particular, the file systems from the other SciNet compute clusters (Niagara, Mist, Trillium,... ) are not and will not be mounted. You'll need to copy over any files that you need to use on the Teach cluster.
  • There is no $SCRATCH. You can do all your work on $HOME, which is writable from compute nodes
  • The software stack is the one supplied by the Alliance. There is no need to load 'CCEnv' to get them.
  • But as before, if you're missing a module, we can still install it for you.