Quickstart new

From SciNet Users Documentation
Jump to navigation Jump to search

Progressive approach to test and run jobs on niagara

1) Choose the proper software stack: NiaEnv(default) or CCEnv. Load modules and/or compile your executable on the login nodes.

NiaEnv and CCEnv

On Niagara, there are really two software stacks:

  1. A Niagara software stack tuned and compiled for this machine. This stack is available by default, but if not, can be reloaded with

    module load NiaEnv
  2. The same software stack available on Compute Canada's General Purpose clusters Graham and Cedar, compiled (for now) for a previous generation of CPUs:

    module load CCEnv

    Or, if you want the same default modules loaded as on Cedar and Graham, then do

    module load CCEnv

    module load StdEnv

Loading software modules

You have two options for running code on Niagara: use existing software, or compile your own. This section focuses on the former.

Other than essentials, all installed software is made available using module commands. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be found on the modules page.

Common module subcommands are:

  • module load <module-name>: load the default version of a particular software.
  • module load <module-name>/<module-version>: load a specific version of a particular software.
  • module purge: unload all currently loaded modules.
  • module spider (or module spider <module-name>): list available software packages.
  • module avail: list loadable software packages.
  • module list: list loaded modules.

Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.

There are handy abbreviations for the module commands. ml is the same as module list, and ml <module-name> is the same as module load <module-name>.

2) make sure the executable runs on a login node at least over 1 CPU, then 2 CPUs and then 4 CPUs max, and for no more then 15 minutes. These are only preliminary checks, not production runs.

3) from this point on, request an interactive session on the debug queue for *ONE* node, at most 1 hour, and ensure you can scale up to all 40 CPUs, without running out of memory or some other single node specific hiccup. Pay attention to all messages and notifications on the standard output, and fix all the bugs detected up to this stage, if any.

4) then request an interactive session on the debug queue for *TWO* nodes, at most 1 hour, start adjusting your scripts to run over multiple nodes, possibly try hyperthreading.

5) then focus on the submission script itself in batch mode, submit it to 1 node on the debug queue, for 15 minutes, then for 2 nodes for 30 minutes, login to the nodes to check what is going on, and again, pay close attention to any error messages or notifications on the logs.

6) From this point on, and only then, submit your jobs to the normal batch queue, for *ONE* hour at first and for *TWO* nodes only. Everything going well start to introduce checkpoint strategies to your dataset and workflow, so you may pickup the slack in case of disruptions on the execution.

7) after you developed a reasonable checkpoint procedure, and only then, slowly scale up your submission in 2 dimensions, 4, 10 and 20 nodes, then 4, 8 and 12 hours. Be sure your type of job scales well over higher number of nodes first, before asking for more time.

PLEASE, REFRAIN FROM SUBMITTING A JOB FOR THE FIRST TIME ASKING ALREADY FOR THE MAX NUMBER OF NODES AND THE MAX AMOUNT OF TIME, IF YOU ARE NOT REASONABLY SURE IT WILL WORK, OR YOU WILL BE WASTING A LOT OF RESOURCES.