Rouge
Rouge | |
---|---|
Installed | March 2021 |
Operating System | Linux (Centos 7.6) |
Number of Nodes | 20 |
Interconnect | Infiniband (2xEDR) |
Ram/Node | 512 GB |
Cores/Node | 48 |
GPUs/Node | 8 MI50-32GB |
Login/Devel Node | rouge-login01 |
Vendor Compilers | rocm/gcc |
Queue Submission | slurm |
Specifications
The Rouge cluster was donated to the University of Toronto by AMD as part of their COVID-19 HPC Fund support program. The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.
The nodes are interconnected with 2xHDR100 Infiniband for internode communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs.
Access and support requests should be sent to support@scinet.utoronto.ca.
Getting started on Rouge
Rouge login node rouge-login01 can be accessed via the Niagara cluster.
ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca ssh -Y rouge-login01
Storage
The filesystem for Rouge is currently shared with Niagara cluster. See Niagara Storage for more details.
Loading software modules
You have two options for running code on : use existing software, or compile your own. This section focuses on the former.
Other than essentials, all installed software is made available using module commands. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be found on the modules page.
Common module subcommands are:
module load <module-name>
: load the default version of a particular software.module load <module-name>/<module-version>
: load a specific version of a particular software.module purge
: unload all currently loaded modules.module spider
(ormodule spider <module-name>
): list available software packages.module avail
: list loadable software packages.module list
: list loaded modules.
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.
There are handy abbreviations for the module commands. ml
is the same as module list
, and ml <module-name>
is the same as module load <module-name>
.
Available compilers and interpreters
- The Rocm module has to be loaded first for GPU software.
- To compile mpi code, you must additionally load an openmpi module.
ROCm
Software
Singularity Containers
/scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif /scinet/rouge/amd/containers/lammps.rocm401.ubuntu18.sif /scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif /scinet/rouge/amd/containers/openmm.rocm401.ubuntu18.sif
GROMACS
The HIP version of GROMACS 2020.3 (better performance than OpenCL version) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations. Job example:
#!/bin/bash #SBATCH --time=1:00:00 #SBATCH --nodes=1 #SBATCH --gpus-per-node=1 export SINGULARITY_HOME=$SLURM_SUBMIT_DIR singularity exec -B /home -B /scratch --env OMP_PLACES=cores /scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif gmx mdrun -pin off -ntmpi 1 -ntomp 6 ...... # setting '-ntomp 4' might give better performance, do your own benchmark. not recommended to set larger than 6 for single GPU job # if you worry about 'GPU update with domain decomposition lacks substantial testing and should be used with caution.' warning message (if there is any), add '-update cpu' to override
NAMD
The HIP version of NAMD (3.0a) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations. Job example:
#!/bin/bash #SBATCH --time=1:00:00 #SBATCH --nodes=1 #SBATCH --gpus-per-node=1 export SINGULARITY_HOME=$SLURM_SUBMIT_DIR singularity exec -B /home -B /scratch --env LD_LIBRARY_PATH=/opt/rocm/lib:/.singularity.d/libs /scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif namd2 +idlepoll +p 12 stmv.namd # do not set +p flag larger than 12, there are only 6 cores (12 threads) per single GPU job.
PyTorch
Install PyTorch into a python virtual environment:
module load python gcc mkdir -p ~/.virtualenvs virtualenv --system-site-packages ~/.virtualenvs/pytorch-rocm source ~/.virtualenvs/pytorch-rocm/bin/activate pip3 install torch -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html pip3 install ninja && pip3 install 'git+https://github.com/pytorch/vision.git@v0.9.1'
Run PyTorch job with single GPU:
#!/bin/bash #SBATCH --time=1:00:00 #SBATCH --nodes=1 #SBATCH --gpus-per-node=1 module load python gcc source ~/.virtualenvs/pytorch-rocm/bin/activate python code.py