Quickstart new
Progressive approach to test and run jobs on niagara
1) Choose the proper software stack: NiaEnv(default) or CCEnv. Load modules and/or compile your executable on the login nodes.
NiaEnv and CCEnv
On Niagara, there are really two software stacks:
A Niagara software stack tuned and compiled for this machine. This stack is available by default, but if not, can be reloaded with
module load NiaEnv
The same software stack available on Compute Canada's General Purpose clusters Graham and Cedar, compiled (for now) for a previous generation of CPUs:
module load CCEnv
Or, if you want the same default modules loaded as on Cedar and Graham, then do
module load CCEnv
module load StdEnv
3) Determine whether you will use existing software on Niagara or compile your own.
Using existing software
Other than essentials, all installed software is made available using module commands. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be found on the modules page.
Common module subcommands are:
module load <module-name>
: load the default version of a particular software.module load <module-name>/<module-version>
: load a specific version of a particular software.module purge
: unload all currently loaded modules.module spider
(ormodule spider <module-name>
): list available software packages.module avail
: list loadable software packages.module list
: list loaded modules.
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.
There are handy abbreviations for the module commands. ml
is the same as module list
, and ml <module-name>
is the same as module load <module-name>
.
Compiling on Niagara: Example
Suppose one want to compile an application from two c source files, appl.c and module.c, which use the Math Kernel Library. This is an example of how this would be done:
nia-login07:~$ module list Currently Loaded Modules: 1) NiaEnv/2018a (S) Where: S: Module is Sticky, requires --force to unload or purge nia-login07:~$ module load intel/2018.2 nia-login07:~$ ls appl.c module.c nia-login07:~$ icc -c -O3 -xHost -o appl.o appl.c nia-login07:~$ icc -c -O3 -xHost -o module.o module.c nia-login07:~$ icc -o appl module.o appl.o -mkl nia-login07:~$ ./appl
Note:
- The optimization flags -O3 -xHost allow the Intel compiler to use instructions specific to the architecture CPU that is present (instead of for more generic x86_64 CPUs).
- Linking with the Intel Math Kernel Library (MKL) is easy when using the intel compiler, it just requires the -mkl flags.
- If compiling with gcc, the optimization flags would be -O3 -march=native. For the way to link with the MKL, it is suggested to use the MKL link line advisor.
3) make sure the executable runs on a login node at least over 1 CPU, then 2 CPUs and then 4 CPUs max, and for no more then 15 minutes. These are only preliminary checks, not production runs.
4) from this point on, request an interactive session on the debug queue for *ONE* node, at most 1 hour, and ensure you can scale up to all 40 CPUs, without running out of memory or some other single node specific hiccup. Pay attention to all messages and notifications on the standard output, and fix all the bugs detected up to this stage, if any.
5) then request an interactive session on the debug queue for *TWO* nodes, at most 1 hour, start adjusting your scripts to run over multiple nodes, possibly try hyperthreading.
6) then focus on the submission script itself in batch mode, submit it to 1 node on the debug queue, for 15 minutes, then for 2 nodes for 30 minutes, login to the nodes to check what is going on, and again, pay close attention to any error messages or notifications on the logs.
7) From this point on, and only then, submit your jobs to the normal batch queue, for *ONE* hour at first and for *TWO* nodes only. Everything going well start to introduce checkpoint strategies to your dataset and workflow, so you may pickup the slack in case of disruptions on the execution.
8) after you developed a reasonable checkpoint procedure, and only then, slowly scale up your submission in 2 dimensions, 4, 10 and 20 nodes, then 4, 8 and 12 hours. Be sure your type of job scales well over higher number of nodes first, before asking for more time.
PLEASE, REFRAIN FROM SUBMITTING A JOB FOR THE FIRST TIME ASKING ALREADY FOR THE MAX NUMBER OF NODES AND THE MAX AMOUNT OF TIME, IF YOU ARE NOT REASONABLY SURE IT WILL WORK, OR YOU WILL BE WASTING A LOT OF RESOURCES.