Progressive Approach

From SciNet Users Documentation
Revision as of 18:44, 15 October 2024 by Willis2 (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Progressive approach to run jobs on niagara

1) load modules and/or compile your executable on the login nodes. Choose the proper software stack: NiaEnv(default) or CCEnv.

2) make sure the executable runs on a login node at least over 1 CPU, then 2 CPUs and then 4 CPUs max, and for no more then 15 minutes. These are only preliminary checks, not production runs.

3) from this point on, request an interactive session on the debug queue for *ONE* node, at most 1 hour, and ensure you can scale up to all 40 CPUs, without running out of memory or some other single node specific hiccup. Pay attention to all messages and notifications on the standard output, and fix all the bugs detected up to this stage, if any.

you may use the 'debugjob' utility to get the interactive job or 'salloc'. In principle, any request you would submit with a script to the batch queue you could request on a interactive job with salloc as well. More details here:

https://docs.alliancecan.ca/wiki/Running_jobs#Interactive_jobs
https://slurm.schedmd.com/salloc.html

4) then request an interactive session on the debug queue for *TWO* nodes, at most 1 hour, start adjusting your scripts to run over multiple nodes, possibly try hyperthreading.

5) then focus on the submission script itself in batch mode, submit it to 1 node on the debug queue, for 15 minutes, then for 2 nodes for 30 minutes, login to the nodes to check what is going on, and again, pay close attention to any error messages or notifications on the logs. (Note: you can always run slurm script interactively with bash submit.slurm to test it)

6) From this point on, and only then, submit your jobs to the normal batch queue, for *ONE* hour at first and for *TWO* nodes only. Everything going well start to introduce checkpoint strategies to your dataset and workflow, so you may pickup the slack in case of disruptions on the execution.

7) after you developed a reasonable checkpoint procedure, and only then, slowly scale up your submission in 2 dimensions, 4, 10 and 20 nodes, then 4, 8 and 12 hours. Be sure your type of job scales well over higher number of nodes first, before asking for more time.

8) an important resource to help you keep track of you jobs, and how efficiently they are running on the compute nodes, is to login to the my.scinet portal (same CCDB credentials used to login to niagara):

https://my.scinet.utoronto.ca/

9) There is also the recently created Alliance Portal, really good resource, that complements my.scinet portal very well:

https://portal.alliancecan.ca/
* my.scinet focus is on how efficiently the node resources are being used, along with the history of the jobs
* the portal focus is on what resources are being used, when and by whom

PLEASE, REFRAIN FROM SUBMITTING A JOB FOR THE FIRST TIME ASKING ALREADY FOR THE MAX NUMBER OF NODES AND THE MAX AMOUNT OF TIME, IF YOU ARE NOT REASONABLY SURE IT WILL WORK, OR YOU WILL BE WASTING A LOT OF RESOURCES.

Back to Quickstart