Progressive approach to run jobs on niagara
1) load modules and/or compile your executable on the login nodes. Choose the proper software stack: NiaEnv(default) or CCEnv.
2) make sure the executable runs on a login node at least over 1 CPU, then 2 CPUs and then 4 CPUs max, and for no more then 15 minutes. These are only preliminary checks, not production runs.
3) from this point on, request an interactive session on the debug queue for *ONE* node, at most 1 hour, and ensure you can scale up to all 40 CPUs, without running out of memory or some other single node specific hiccup. Pay attention to all messages and notifications on the standard output, and fix all the bugs detected up to this stage, if any.
4) then request an interactive session on the debug queue for *TWO* nodes, at most 1 hour, start adjusting your scripts to run over multiple nodes, possibly try hyperthreading.
5) then focus on the submission script itself in batch mode, submit it to 1 node on the debug queue, for 15 minutes, then for 2 nodes for 30 minutes, login to the nodes to check what is going on, and again, pay close attention to any error messages or notifications on the logs.
6) From this point on, and only then, submit your jobs to the normal batch queue, for *ONE* hour at first and for *TWO* nodes only. Everything going well start to introduce checkpoint strategies to your dataset and workflow, so you may pickup the slack in case of disruptions on the execution.
7) after you developed a reasonable checkpoint procedure, and only then, slowly scale up your submission in 2 dimensions, 4, 10 and 20 nodes, then 4, 8 and 12 hours. Be sure your type of job scales well over higher number of nodes first, before asking for more time.
PLEASE, REFRAIN FROM SUBMITTING A JOB FOR THE FIRST TIME ASKING ALREADY FOR THE MAX NUMBER OF NODES AND THE MAX AMOUNT OF TIME, IF YOU ARE NOT REASONABLY SURE IT WILL WORK, OR YOU WILL BE WASTING A LOT OF RESOURCES.