https://docs.scinet.utoronto.ca/api.php?action=feedcontributions&user=Ejspence&feedformat=atom
SciNet Users Documentation - User contributions [en]
2024-03-28T14:13:40Z
User contributions
MediaWiki 1.35.12
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=5421
Main Page
2024-01-29T12:57:06Z
<p>Ejspence: </p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up", "Partial" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Down |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|-<br />
|{{Up |Balam|Balam}}<br />
|{{Up |CCEnv|Using_modules}}<br />
|}<br />
<br />
'''Mon January 29, 07:35 (EST):''' No access to Niagara login nodes. We are investigating. Use the Mist login to get access to SciNet systems.<br />
<br />
'''Wed January 24, 15:20 (EST):''' maintenance on rouge-login01 <br />
<br />
'''Wed January 24, 14:55 (EST):''' Rebooting rouge-login01 <br />
<br />
'''Tue January 23, 10:25 am (EST):''' Mist-login01 maintenance done <br />
<br />
'''Tue January 23, 10:10 am (EST):''' Rebooting Mist-login01 to deploy new image<br />
<br />
'''Tue January 22, 21:00 am (EST):''' HPSS performance for hsi & htar clients is back to normal.<br />
<br />
'''Tue January 20, 11:50 am (EST):''' HPSS hsi/htar/VFS jobs will remain on PD state on the queue over the weekend, so we may work on archive02/vfs02 on Monday, and try to improve transfer performance. In the meantime you may use Globus (computecanada#hpss) if your workflow is suitable. <br />
<br />
'''Tue January 14, 13:20 am (EST):''' The ongoing HPSS jobs from Friday finished earlier, so we restarted HPSS sooner and released the PD jobs on the queue. <br />
<br />
'''Tue January 12, 10:40 am (EST):''' We have applied some tweaks to the HPSS configuration to improve performance, but they won't take effect until we restart the services, which scheduled for Monday morning. If over the weekend we notice that there are no HPSS jobs running on the queue we may restart HPSS sooner. <br />
<br />
'''Tue January 09, 9:10 am (EST):''' Remaining cvmfs issues cleared.<br />
<br />
'''Tue January 09, 8:00 am (EST):''' We're investigating remaining issues with cvmfs access on login nodes.<br />
<br />
'''Mon January 08, 21:50 pm (EST):''' File systems are back to normal. Please resubmit your jobs. <br />
<br />
'''Mon January 08, 9:10 pm (EST):''' We had a severe deadlock, and some disk volumes went down. The file systems are being recovered now. It could take another hour.<br />
<br />
'''Mon January 08, 7:20 pm (EST):''' We seem to have a problem with the file system, and are investigating.<br />
<br />
<!-- When removing system status entries, please archive them to: --><br />
[[Previous messages]]<br />
<br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH#SSH Keys|SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=5262
Main Page
2023-12-11T12:52:05Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up", "Partial" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Down |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|}<br />
<br />
'''Mon Dec 11 7:51:00 EST 2023:''' Niagara's login nodes are being overwhelmed. We are investigating.<br />
<br />
'''Thu Dec 6 10:01:24 EST 2023:''' Niagara's scheduler rebooting for security patches.<br />
<br />
'''Wed Dec 6 13:06:46 EST 2023:''' Endpoint computecanada#niagara transition from Globus GCSv4 to GCSv5 is completed. computecanada#niagara-GCSv4 has been deactivated<br />
<br />
'''Mon Dec 4 16:35:07 EST 2023:''' Endpoint computecanada#niagara has now been upgraded to Globus GCSv5. The old endpoint is still available as computecanada#niagara-GCSv4 on nia-datamover2, only until Wednesday, at which time we'll disable it as well.<br />
<br />
'''Mon Dec 4 11:54:49 EST 2023:''' The nia-datamover1 node will the offline this Monday afternoon for the Globus GCSv5 upgrade. Endpoint computecanada#niagara-GCSv4 will still be available via nia-datamover2<br />
<br />
'''Tue Nov 28 16:29:14 EST 2023:''' The computecanada#hpss Globus endpoint is now running GCSv5. We'll find a window of opportunity next week to upgrade computecanada#niagara to GCSv5 as well.<br />
<br />
'''Tue Nov 28 14:20:30 EST 2023:''' The computecanada#hpss Globus endpoint will be offline for the next few hours for the GCSv5 upgrade.<br />
<br />
'''Fri Nov 10, 2023, 18:00 PM EDT:''' The HPSS upgrade is finished. We didn't have time to update Globus to GCSv5, so we'll find a window of opportunity to do this next week. <br />
<br />
Please be advised that starting this <B>Friday morning, Nov/10, we'll be upgrading the HPSS system from version 8.3 to 9.3 and the HPSS Globus server from GCSv4 to GCSv5.</B> Everything going well we expect to be back online by the end of the day. <br />
<br />
<!-- When removing system status entries, please archive them to: --><br />
[[Previous messages]]<br />
<br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH#SSH Keys|SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=5259
Main Page
2023-12-11T12:51:02Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up", "Partial" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Down |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|}<br />
<br />
'''Thu Dec 6 10:01:24 EST 2023:''' Niagara's scheduler rebooting for security patches.<br />
<br />
'''Wed Dec 6 13:06:46 EST 2023:''' Endpoint computecanada#niagara transition from Globus GCSv4 to GCSv5 is completed. computecanada#niagara-GCSv4 has been deactivated<br />
<br />
'''Mon Dec 4 16:35:07 EST 2023:''' Endpoint computecanada#niagara has now been upgraded to Globus GCSv5. The old endpoint is still available as computecanada#niagara-GCSv4 on nia-datamover2, only until Wednesday, at which time we'll disable it as well.<br />
<br />
'''Mon Dec 4 11:54:49 EST 2023:''' The nia-datamover1 node will the offline this Monday afternoon for the Globus GCSv5 upgrade. Endpoint computecanada#niagara-GCSv4 will still be available via nia-datamover2<br />
<br />
'''Tue Nov 28 16:29:14 EST 2023:''' The computecanada#hpss Globus endpoint is now running GCSv5. We'll find a window of opportunity next week to upgrade computecanada#niagara to GCSv5 as well.<br />
<br />
'''Tue Nov 28 14:20:30 EST 2023:''' The computecanada#hpss Globus endpoint will be offline for the next few hours for the GCSv5 upgrade.<br />
<br />
'''Fri Nov 10, 2023, 18:00 PM EDT:''' The HPSS upgrade is finished. We didn't have time to update Globus to GCSv5, so we'll find a window of opportunity to do this next week. <br />
<br />
Please be advised that starting this <B>Friday morning, Nov/10, we'll be upgrading the HPSS system from version 8.3 to 9.3 and the HPSS Globus server from GCSv4 to GCSv5.</B> Everything going well we expect to be back online by the end of the day. <br />
<br />
<!-- When removing system status entries, please archive them to: --><br />
[[Previous messages]]<br />
<br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH#SSH Keys|SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Slurm&diff=4551
Slurm
2023-02-10T19:15:52Z
<p>Ejspence: /* jobperf */</p>
<hr />
<div>The queueing system used at SciNet is based around the [https://slurm.schedmd.com Slurm Workload Manager]. This "scheduler", Slurm, determines which jobs will be run on which compute nodes, and when. This page outlines how to submit jobs, how to interact with the scheduler, and some of the most common Slurm commands.<br />
<br />
Some common questions about the queuing system can be found on the [[FAQ]] as well.<br />
<br />
= Submitting jobs =<br />
<br />
You submit jobs from a Niagara login node. This is done by passing a script to the sbatch command:<br />
<br />
nia-login07:~$ sbatch jobscript.sh<br />
<br />
This puts the job, described by the job script, into the queue. The scheduler will will run the job on the compute nodes in due course. A typical submission script is as follows.<br />
<br />
<source lang="bash">#!/bin/bash <br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --time=1:00:00<br />
#SBATCH --job-name mpi_job<br />
#SBATCH --output=mpi_output_%j.txt<br />
#SBATCH --mail-type=FAIL<br />
<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
module load intel/2018.2<br />
module load openmpi/3.1.0<br />
<br />
mpirun ./mpi_example<br />
# or "srun ./mpi_example"<br />
</source><br />
<br />
Some notes about this example:<br />
* The first line indicates that this is a bash script.<br />
* Lines starting with <code>#SBATCH</code> go to SLURM.<br />
* sbatch reads these lines as a job request (which it gives the name <code>mpi_job</code>).<br />
* In this case, SLURM looks for 2 nodes with 40 cores on which to run 80 tasks, for 1 hour.<br />
* Note that the mpifun flag "--ppn" (processors per node) is ignored. Slurm takes care of this detail.<br />
* Once the scheduler finds a spot to run the job, it runs the script:<br />
** It changes to the submission directory;<br />
** Loads modules;<br />
** Runs the <code>mpi_example</code> application.<br />
* To use hyperthreading, just change --ntasks-per-node=40 to --ntasks-per-node=80, and add --bind-to none to the mpirun command (the latter is necessary for OpenMPI only, not when using IntelMPI).<br />
<br />
To create a job script appropriate for your work, you must modify the commands above to instruct Slurm to run the commands you need run.<br />
<br />
== Things to remember ==<br />
<br />
There are some things to always bear in mind when crafting your submission script:<br />
* Scheduling is by node, so in multiples of 40 cores. You are expected to use all 40 cores! If you are running serial jobs, and need assistance bundling your work into multiples of 40, please see the [[Running_Serial_Jobs_on_Niagara | serial jobs]] page.<br />
* Jobs must write to your scratch or project directory (home is read-only on compute nodes).<br />
* Compute nodes have no internet access. Download data you need before submitting your job.<br />
* Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below).<br />
* Jobs will run under your group's RRG allocation. If your group does not have an allocation, your job will run under your group's RAS allocation (previously called `default' allocation). Note that groups with an allocation cannot run under a default allocation.<br />
* The maximum [[Wallclock_time | walltime]] for all users is 24 hours. The minimum and default walltime is 15 minutes.<br />
<br />
= Scheduling details =<br />
<br />
We now present the details of how to write a job script, and some extra commands which you might find useful.<br />
<br />
== SLURM nomenclature: jobs, nodes, tasks, cpus, cores, threads ==<br />
<br />
SLURM has a somewhat different way of referring to things like MPI processes and thread tasks, as compared to our previous scheduler, MOAB. The SLURM nomenclature is reflected in the names of scheduler options (i.e., resource requests). SLURM strictly enforces those requests, so it is important to get this right.<br />
<br />
{| class="wikitable"<br />
!term <br />
!meaning <br />
!SLURM term<br />
!related scheduler options <br />
|-<br />
|job<br />
|scheduled piece of work for which specific resources were requested.<br />
|job<br />
|<tt>sbatch, salloc</tt><br />
|-<br />
|node<br />
|basic computing component with several cores (40 for Niagara) that share memory <br />
|node<br />
|<tt>--nodes -N</tt><br />
|-<br />
|mpi process<br />
|one of a group of running programs using Message Passing Interface for parallel computing<br />
|task<br />
|<tt>--ntasks -n --ntasks-per-node</tt><br />
|-<br />
|core ''or'' physical cpu<br />
|A fully functional independent physical execution unit.<br />
| - <br />
| -<br />
|-<br />
|logical cpu<br />
|An execution unit that the operating system can assign work to. Operating systems can be configured to overload physical cores with multiple logical cpus using hyperthreading.<br />
|cpu<br />
|<tt>--cpus-per-task</tt><br />
|-<br />
|thread<br />
|one of possibly multiple simultaneous execution paths within a program, which can share memory.<br />
| -<br />
| <tt>--cpus-per-task</tt> '''and''' <tt>OMP_NUM_THREADS</tt><br />
|-<br />
|hyperthread<br />
|a thread run in a collection of threads that is larger than the number of physical cores.<br />
| -<br />
| -<br />
|}<br />
<br />
== Scheduling by Node ==<br />
<br />
* On many systems that use SLURM, the scheduler will deduce from the job script specifications (the number of tasks and the number of cpus-per-node) what resources should be allocated. On Niagara, this is a bit different.<br />
* All job resource requests on Niagara are scheduled as a multiple of '''nodes'''.<br />
* The nodes that your jobs run on are exclusively yours.<br />
** No other users are running anything on them.<br />
** You can ssh into them, while your job is running, to see how things are going.<br />
* Whatever you request of the scheduler, your request will always be translated into a multiple of nodes allocated to your job.<br />
* Memory requests to the scheduler are of no use. Your job always gets N x 202GB of RAM, where N is the number of nodes. Each node has about 202GB of RAM available.<br />
* You should try to use all the cores on the nodes allocated to your job. Since there are 40 cores per node, your job should use N x 40 cores. If this is not the case, we will be contacted you to help you optimize your workflow. Again, users which have serials jobs should consult the [[Running Serial Jobs on Niagara | serial jobs]] page.<br />
<br />
== Hyperthreading: Logical CPUs vs. cores ==<br />
<br />
Hyperthreading, a technology that leverages more of the physical hardware by pretending there are twice as many logical cores than real cores, is enabled on Niagara.<br />
The operating system and scheduler see 80 logical CPUs.<br />
<br />
Using 80 logical CPUs versus 40 real cores typically gives about a 5-10% speedup, depending on your application (your mileage may vary).<br />
<br />
Because Niagara is scheduled by node, hyperthreading is actually fairly easy to use:<br />
* Ask for a certain number of nodes, N, for your job.<br />
* You know that you get 40 x N cores, so you will use (at least) a total of 40 x N MPI processes or threads (mpirun, srun, and the OS will automaticallly spread these over the real cores).<br />
* But you should also test if running 80 x N MPI processes or threads gives you any speedup.<br />
* Regardless, your usage will be counted as 40 x N x (walltime in years).<br />
<br />
Many applications which are communication-heavy can benefit from the use of hyperthreading.<br />
<br />
= Submission script details =<br />
<br />
This section outlines some details of how to interact with the scheduler, and how it implements Niagara's scheduling policies.<br />
<br />
== Queues ==<br />
<br />
There are 3 queues available on SciNet systems. These queues have different limits; see the [[#Limits | Limits]] section for further details.<br />
<br />
=== Compute ===<br />
<br />
The compute queue is the default queue. Most jobs will run in this queue. If no flags are specified in the submission script this is the queue where your job will land.<br />
<br />
=== Debug ===<br />
<br />
The Debug queue is a high-priority queue, used for short-term testing of your code. Do NOT use the debug queue for production work. You can use the debug queue one of two ways. To submit a standard job script to the debug queue, add the line<br />
#SBATCH -p debug<br />
to your submission script. This will put the job into the debug queue, and it should run in short order.<br />
<br />
To request an interactive debug session, where you retain control over the command line prompt, at a login node type the command<br />
nia-login07:~$ salloc -p debug --nodes 1 --time=1:00:00<br />
This will request 1 node for 1 hour. You can similarly request a debug session using the 'debugjob' command:<br />
nia-login07:~$ debugjob N<br />
where N is the number of nodes, If N=1, this gives an interactive session one 1 hour, when N=4 (the maximum), it gives you 30 minutes.<br />
<br />
=== Archive ===<br />
<br />
The archivelong and archiveshort queues are only used by the [[HPSS]] system. See that page for details on how to use these queues.<br />
<br />
== Limits ==<br />
<br />
There are limits to the size and duration of your jobs, the number of jobs you can run and the number of jobs you can have queued. It matters whether a user is part of a group with a [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ Resources for Research Group allocation] or not. It also matters in which 'partition' the jobs runs. 'Partitions' are SLURM-speak for use cases. You specify the partition with the <tt>-p</tt> parameter to <tt>sbatch</tt> or <tt>salloc</tt>, but if you do not specify one, your job will run in the <tt>compute</tt> partition, which is the most common case. <br />
<br />
{| class="wikitable"<br />
!Usage<br />
!Partition<br />
!Running jobs<br />
!Jobs in queue<br />
!Min. size of jobs<br />
!Max. size of jobs<br />
!Min. walltime<br />
!Max. walltime <br />
|-<br />
|Compute jobs ||compute || 50 || 1000 || 1 node (40&nbsp;cores) || default:&nbsp;20&nbsp;nodes&nbsp;(800&nbsp;cores) <br> with&nbsp;allocation:&nbsp;1000&nbsp;nodes&nbsp;(40000&nbsp;cores)|| 15 minutes || 24 hours<br />
|-<br />
|Testing or troubleshooting || debug || 1 || 1 || 1 node (40 cores) || 4 nodes (160 cores)|| N/A || min(1, 1.5/n<sub>node</sub>) hours<br />
|-<br />
|Archiving or retrieving data in [[HPSS]]|| archivelong || 2 per user (max 5 total) || 10 per user || N/A || N/A|| 15 minutes || 72 hours<br />
|-<br />
|Inspecting archived data, small archival actions in [[HPSS]] || archiveshort || 2 per user|| 10 per user || N/A || N/A || 15 minutes || 1 hour<br />
|}<br />
<br />
Even if you respect these limits, your jobs will still have to wait in the queue. The waiting time depends on many factors such as your group's allocation amount, how much allocation has been used in the recent past, the number of requested nodes and walltime, and how many other jobs are waiting in the queue.<br />
<br />
== Slurm Accounts ==<br />
<br />
To be able to prioritise jobs based on groups and allocations, the Slurm scheduler uses the concept of ''accounts''. Each group that has a Resource for Research Groups (RRG) or Research Platforms and Portals (RPP) allocation (awarded through an annual competition by Compute Canada) has an account that starts with <tt>rrg-</tt> or <tt>rpp-</tt>. Slurm assigns a 'fairshare' priority to these accounts based on the size of the award in core-years. Groups without an RRG or RPP can use Niagara using a so-called Rapid Access Service (RAS), and have an account that starts with <tt>def-</tt>.<br />
<br />
On Niagara, most users will only ever use one account, and those users do not need to specify the account to Slurm. However, users that are part of collaborations may be able to use multiple accounts, i.e., that of their sponsor and that of their collaborator, but this mean that they need to select the right account when running jobs. <br />
<br />
To select the account, just add <br />
<br />
#SBATCH -A [account]<br />
<br />
to the job scripts, or use the <tt>-A [account]</tt> to <tt>salloc</tt> or <tt>debugjob</tt>. <br />
<br />
To see which accounts you have access to, or what their names are, use the command<br />
<br />
sshare -U<br />
<br />
It has been noted that, in some cases, using the '-A' flag does not result in the appropriate account being used. To get around this, specify the account when sbatch is invoked:<br />
sbatch -A account myjobscript.sh<br />
<br />
== Slurm environment variables ==<br />
<br />
There are many environment variables built into Slurm. These are some which you may find useful:<br />
* SLURM_SUBMIT_DIR: directory from which the job was submitted.<br />
* SLURM_SUBMIT_HOST: host from which the job was submitted.<br />
* SLURM_JOB_ID: the job's id.<br />
* SLURM_JOB_NUM_NODES: number of nodes in the job.<br />
* SLURM_JOB_NODELIST: list of nodes assigned to the job.<br />
* SLURM_JOB_ACCOUNT: account associated with the job.<br />
<br />
Any of these environment variables can be accessed from within your job script.<br />
<br />
== Passing Variables to submission scripts ==<br />
It is possible to pass values through environment variables into your SLURM submission scripts.<br />
For doing so with already defined variables in your shell, just add the following directive in the submission script,<br />
<br />
#SBATCH --export=ALL<br />
<br />
and you will have access to any predefined environment variable.<br />
<br />
A better way is to specify explicitly which variables you want to pass into the submision script,<br />
<br />
sbatch --export=i=15,j='test' jobscript.sbatch<br />
<br />
You can even set the job name and output files using environment variables, eg.<br />
<br />
i="simulation"<br />
j=14<br />
sbatch --job-name=$i.$j.run --output=$i.$j.out --export=i=$i,j=$j jobscript.sbatch<br />
<br />
(The latter only works on the command line; you cannot use environment variables in <tt>#SBATCH</tt> lines in the job script.)<br />
<br />
== Command line arguments ==<br />
<br />
Command line arguments can also be used for job script in the same way as command line argument for shell scripts. All command line arguments given to sbatch that follow after the job script name, will be passed to the job script. In fact, SLURM will not look at any of these arguments, so you must place all sbatch arguments before the script name, e.g.:<br />
<br />
sbatch -p debug jobscript.sbatch FirstArgument SecondArgument ...<br />
<br />
In this example, <tt>-p debug</tt> is interpreted by SLURM, while in your submission script you can access <tt>FirstArgument</tt>, <tt>SecondArgument</tt>, etc., by referring to <code>$1, $2, ...</code>.<br />
<br />
== Job arrays ==<br />
<br />
Sometimes you need to run the same job script many times, but just tweaking one value each time. One way of accomplishing this is using job arrays. Job arrays are invoked using the "-a" flag with sbatch:<br />
sbatch -a 1-100 myjobscript.sh<br />
This will submit 100 instances of myjobscript.sh. Within the job script you can distinguish which of those instances is running using the environment variable SLURM_ARRAY_TASK_ID.<br />
<br />
Note that Niagara [[#Limits | currently]] has a limit of 1000 submitted jobs for users within groups with allocations, and 200 submitted jobs without an allocation.<br />
<br />
== Job dependencies ==<br />
<br />
You can make one job dependent on the successful completion of another job using the following command:<br />
sbatch --dependency=afterok:JOBID myjobscript.sh<br />
This will make the current job submission not start until the parent job, with jobid JOBID, successfully completes. There are many job dependency options available. Visit the [https://slurm.schedmd.com/sbatch.html#OPT_dependency Slurm sbatch page ] for the full list. <br />
<br />
If the parent job fails (that is, ends with a non-zero exit code) the dependent job can never be scheduled and will be automatically cancelled.<br />
<br />
== Email Notification ==<br />
Email notification works, but you need to add the email address and type of notification you may want to receive in your submission script, eg.<br />
<br />
#SBATCH --mail-user=YOUR.email.ADDRESS<br />
#SBATCH --mail-type=ALL<br />
<br />
The sbatch man page (type <tt>man sbatch</tt> on Niagara) explains all possible mail-types.<br />
<br />
== Job Location Constraints ==<br />
<br />
=== Node types ===<br />
<br />
With the expansion of Niagara there are now two node types, 1548 Intel 6148 "skylake" CPU based nodes, and 468 Intel 6248 "cascadelake" CPU based nodes. By default a job will be placed on the first available nodes but will not span node types. You can specify a node type using one of the following directives to your submission script.<br />
<br />
#SBATCH --constraint=skylake <br />
#SBATCH --constraint=cascade<br />
<br />
=== EDR/HDR Infiniband Topology ===<br />
<br />
The Infiniband high speed network used for job communication and file I/O on Niagara consists of 5 1:1 subscribed "wings" that connected together in a dragonfly topology with adaptive routing enabled. 4 wings (dragonfly[1-4]) consist of EDR based skylakde nodes and dragonfly5 contains all the of HDR100 cascadelake nodes. By default multi-node jobs will run on the first available nodes which could be all within 1 wing, or span across multiple wings, but not across node types. For most scalable parallel programs the performance difference should not be very significant, however if you wish keep your jobs from spanning wings you can use the following.<br />
<br />
#SBATCH --constraint=[dragonfly1|dragonfly2|dragonfly3|dragonfly4|dragonfly5]<br />
<br />
= Monitoring jobs =<br />
<br />
There are many options available for monitoring your jobs. The most basic of which is the squeue command:<br />
<br />
nia-login07:~$ squeue -u USERNAME<br />
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<br />
292047 compute myjob4 username PD 0:00 4 (Priority)<br />
292048 compute myjob3 username PD 0:00 4 (Priority)<br />
266829 compute myjob2 username R 18:56:17 2 nia[1397-1398]<br />
266828 compute myjob1 username R 18:56:46 1 nia1298<br />
<br />
Here you can see that we have two running jobs ('R') and two pending jobs ('PD'). The nodes being used are listed.<br />
<br />
== Job status ==<br />
<br />
To get an estimate of when a job will start, use the command<br />
squeue --start -j JOBID<br />
Note that this is only an estimate, and tends not to be very accurate.<br />
<br />
Information about a specific job can be found using the <br />
squeue -j JOBID<br />
or alternatively<br />
scontrol show job JOBID<br />
which is more verbose.<br />
<br />
== SSHing to a node ==<br />
<br />
Once your job has started, the node belongs to you. As such you may, from a login node, SSH into the node to check the performance of your job. The first step is to find out which nodes are being used (see above). Once you have your list of nodes, you can SSH into them directly. Once there, you can run the 'top' or 'free' commands to check both CPU and memory usage.<br />
<br />
== jobperf ==<br />
<br />
The jobperf script will give you feedback on the performance of your currently-running job:<br />
nia-login07:~$ jobperf 123456<br />
----------------------------------------------------------------------------------------------------<br />
RUNNING IDLE USER MEMORY(MB) PROCESS NAMES<br />
HOSTNAME # %CPU %MEM DISK SLEEP NAME RAMDISK USED AVAIL (excl:bash,sh,ssh,sshd)<br />
----------------------------------------------------------------------------------------------------<br />
nia1013 71 174% 0.5% 0 22 ejspence 0 15060 178017 14*gmx_mpi mpiexec slurm_script<br />
nia1014 79 192% 0.1% 0 18 ejspence 0 14803 178274 13*gmx_mpi<br />
nia1295 79 188% 0.4% 0 18 ejspence 0 15199 177878 13*gmx_mpi<br />
----------------------------------------------------------------------------------------------------<br />
<br />
Here you can see both the CPU and memory usage of the job, for all nodes being used.<br />
<br />
== Other commands ==<br />
<br />
Some other commands had can be useful for dealing with your jobs:<br />
* <code>scancel -i JOBID</code> cancels a specific job.<br />
* <code>sacct</code> gives information about your recent jobs.<br />
* <code>sinfo -p compute</code> gives a list of available nodes.<br />
* <code>qsum</code> gives a summary of the queue by user.<br />
<br />
= Example submission scripts =<br />
<br />
Here we present some examples of how to create submission scripts for running parallel jobs. Serial job examples can be found on the [[Running_Serial_Jobs_on_Niagara | serial jobs page]].<br />
<br />
== Example submission script (MPI) ==<br />
<br />
<source lang="bash">#!/bin/bash <br />
#SBATCH --nodes=8<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --time=1:00:00<br />
#SBATCH --job-name mpi_job<br />
#SBATCH --output=mpi_output_%j.txt<br />
#SBATCH --mail-type=FAIL<br />
<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
module load intel/2018.2<br />
module load openmpi/3.1.0<br />
<br />
mpirun ./mpi_example<br />
# or "srun ./mpi_example"<br />
</source><br />
Submit this script with the command:<br />
<br />
nia-login07:~$ sbatch mpi_job.sh<br />
<br />
<ul><br />
<li><p>First line indicates that this is a bash script.</p></li><br />
<li><p>Lines starting with <code>#SBATCH</code> go to SLURM.</p></li><br />
<li><p>sbatch reads these lines as a job request (which it gives the name <code>mpi_job</code>)</p></li><br />
<li><p>In this case, SLURM looks for 8 nodes with 40 cores on which to run 320 tasks, for 1 hour.</p></li><br />
<li><p>Note that the mpifun flag "--ppn" (processors per node) is ignored.</p></li><br />
<li><p>Once it found such a node, it runs the script:</p><br />
<ul><br />
<li>Change to the submission directory;</li><br />
<li>Loads modules;</li><br />
<li>Runs the <code>mpi_example</code> application.</li><br />
</ul><br />
<li>To use hyperthreading, just change --ntasks-per-node=40 to --ntasks-per-node=80, and add --bind-to none to the mpirun command (the latter is necessary for OpenMPI only, not when using IntelMPI).</li><br />
</ul><br />
<br />
== Example submission script (OpenMP) ==<br />
<br />
<source lang="bash">#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --cpus-per-task=40<br />
#SBATCH --time=1:00:00<br />
#SBATCH --job-name openmp_job<br />
#SBATCH --output=openmp_output_%j.txt<br />
#SBATCH --mail-type=FAIL<br />
<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
module load intel/2018.2<br />
<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
./openmp_example<br />
# or "srun ./openmp_example".<br />
</source><br />
Submit this script with the command:<br />
<br />
nia-login07:~$ sbatch openmp_job.sh<br />
<br />
* First line indicates that this is a bash script.<br />
* Lines starting with <code>#SBATCH</code> go to SLURM.<br />
* sbatch reads these lines as a job request (which it gives the name <code>openmp_job</code>) .<br />
* In this case, SLURM looks for one node with 40 cores to be run inside one task, for 1 hour.<br />
* Once it found such a node, it runs the script:<br />
** Change to the submission directory;<br />
** Loads modules;<br />
** Sets an environment variable;<br />
** Runs the <code>openmp_example</code> application.<br />
* To use hyperthreading, just change <code>--cpus-per-task=40</code> to <code>--cpus-per-task=80</code>.</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Singularity&diff=4404
Singularity
2022-12-16T16:06:59Z
<p>Ejspence: </p>
<hr />
<div>Please see our [[Docker]] page.</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=4173
Main Page
2022-09-16T14:01:52Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up", "Partial" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|}<br />
<br />
'''Fri Sept 16, 2022, 9:30 AM EDT:''' Login nodes are accessible again.<br />
<br />
'''Fri Sept 16, 2022, 9:00 AM EDT:''' Login nodes are not accessible. We are investigating.<br />
<br />
'''Tue Sep 13, 2022, 11:00 AM EDT:''' Mist login node is available again.<br />
<br />
'''Tue Sep 13, 2022, 10:00 AM EDT:''' Mist login node is under maintenance and temporarily inaccessible to users.<br />
<br />
'''Fri Sep 2, 2022, 11:25 AM EDT:''' Rouge login node is back up.<br />
<br />
'''Fri Sep 2, 2022, 10:25 AM EDT:''' Issues with the Rouge login node; we are investigating.<br />
<br />
<!-- When removing system status entries, please archive them to: --><br />
[[Previous messages]]<br />
<br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH#SSH Keys|SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Previous_messages&diff=4170
Previous messages
2022-09-16T14:01:44Z
<p>Ejspence: </p>
<hr />
<div>'''Tue Aug 23, 2022, 1:15 PM EDT:''' Jupyter Hub is available again.<br />
<br />
'''Tue Aug 23, 2022, 1:00 PM EDT:''' Jupyter Hub is being updated and temporarily inaccessible to users.<br />
<br />
'''Fri Aug 12, 2022, 6:30 PM EDT:''' File system issues are resolved.<br />
<br />
'''Fri Aug 12, 2022, 5:06 PM EDT:''' File system issues. We are investigating.<br />
<br />
'''Thu Aug 11, 2022, 9:20 AM EDT:''' The login node issues have been resolved.<br />
<br />
'''Thu Aug 11, 2022, 7:50 AM EDT:''' We are having problems accessing the Niagara login nodes. Until fixed, please login to Mist and then ssh to a Niagara login node to access Niagara ("ssh nia-login02", for example).<br />
<br />
'''Fri July 15, 2022, 10:50 AM EDT:''' Jupyter Hub is available again.<br />
<br />
'''Fri July 15, 2022, 10:30 AM EDT:''' Jupyter Hub is being updated and temporarily inaccessible to users.<br />
<br />
'''Wed June 16, 2022, 3:45 PM EDT:''' File system is stable now. We're gradually opening the systems up.<br />
<br />
'''Wed June 16, 2022, 10:15 AM EDT:''' Emergency maintenance shutdown of filesystem. Running jobs will be affected.<br />
<br />
'''Wed June 15, 2022, 7:35 PM EDT:''' Maintenance shutdown finished. Most systems are available again.<br />
<br />
'''Wed June 15, 2022, 7:00 AM EDT:''' Maintenance shutdown of the SciNet datacentre. There will be no access to any of the SciNet systems during this time. We expect to be able to bring the systems back online in the evening of June 15th.<br />
<br />
'''Mon June 13, 2022, 7:00 AM EDT - Wed June 15, 2022, 7:00 AM EDT:''' Two-day reservation for the "Niagara at Scale" event. Only "Niagara at Scale" projects will run on the compute nodes (as well as SOSCIP projects, on a subset of nodes). Users are encouraged to submit small and short jobs that could run before this event. Throughout the event, users can still login, access their data, and submit jobs, but these jobs will not run until after the subsequent maintenance (see below). Note that the debugjob queue will remain available to everyone as well.<br />
<br />
'''Mon May 30th, 2022, 12:42:00 EDT:''' Mist login node is available again.<br />
<br />
'''Mon May 30th, 2022, 10:22:00 EDT:''' Mist login node is being upgraded and temporarily inaccessible to users.<br />
<br />
'''Wed May 25th, 2022, 13:30:00 EDT:''' Niagara operating at 100% again.<br />
<br />
'''Tue May 24th, 2022, 21:30:00 EDT:''' Jupyter Hub up. Part of Niagara can run compute jobs.<br />
<br />
'''Tue May 24th, 2022, 19:00:00 EDT:''' Systems are up. Users can login, BUT cannot submit jobs yet.<br />
<br />
'''Tue May 24th, 2022, 10:00:00 EDT:''' We are still performing system checks.<br />
<br />
'''Mon May 23rd, 2022, 16:44:30 EDT:''' Systems still down. Filesystems are working, but there are quite a number of drive failures - no data loss - so out of an abundance of caution we are keeping the systems down at least until tomorrow. The long weekend has also been disruptive for service response, and we prefer to err on the safe side.<br />
<br />
'''Mon May 23rd, 2022, 08:12:14 EDT:''' Systems still down. Filesystems being checked to ensure no heat damage.<br />
<br />
'''Sun May 22nd, 2022, 10.16 am EDT:''' Electrician dispatched to replace blown fuses.<br />
<br />
'''Sun May 22nd, 2022, 2:54 am EDT:''' Automatic shutdown down due to power/cooling.<br />
<br />
'''Fri May 6th, 2022, 11:35 am EDT:''' HPSS scheduler upgrade also finished.<br />
<br />
'''Thu May 5th, 2022, 7:45 pm EDT:''' Upgrade of the scheduler has finished, with the exception of HPSS.<br />
<br />
'''Thu May 5th, 2022, 7:00 am - 3:00 pm EDT (approx):''' Starting from 7:00 am EDT, an upgrade of the scheduler of the Niagara, Mist, and Rouge clusters will be applied. This requires the scheduler to be down for about 5-6 hours, and all compute and login nodes to be rebooted.<br />
Jobs cannot be submitted during this maintenance, but jobs submitted beforehand will remain in the queue. For most of the time, the login nodes of the clusters will be available so that users may access their files on the home, scratch, and project file systems.<br />
<br />
'''Monday May 2nd, 2022, 9:30 - 11:00 am EDT:''' the Niagara login nodes, the jupyter hub, and nia-datamover2 will get rebooted for updates. In the process, any login sessions will get disconnected, and servers on the jupyterhub will stop. Jobs in the Niagara queue will not be affected.<br />
<br />
'''Tue Apr 26, 11:20 AM EDT:''' A Rolling update of the Mist cluster is taking a bit longer than expected, affecting logins to Mist. <br />
<br />
'''Announcement:''' On Thursday April 14th, 2022, the connectivity to the SciNet datacentre will be disrupted at 11:00 AM EDT for a few minutes, in order to deploy a new network core switch. Any SSH connections or data transfers to SciNet systems (Niagara, Mist, etc.) may be terminated at that time.<br />
<br />
'''Thu March 24, 6:54 AM EST:''' HPSS is back online<br />
<br />
'''Thu March 24, 8:15 AM EST:''' HPSS has a hardware problem<br />
<br />
'''Wed March 2, 4:50 PM EST:''' The CCEnv software stack is available again on Niagara.<br />
<br />
'''Wed March 2, 7:50 AM EST:''' The CCEnv software stack on Niagara has issues; we are investigating.<br />
<br />
'''Sat Feb 12 2022, 12:59 EST:''' Jupyterhub is back up, but may have hardware issue.<br />
<br />
'''Sat Feb 12 2022, 10:36 EST:''' Issue with the Jupyterhub, since last night. We're investigating.<br />
<br />
'''Tue Feb 1 2022 19:20 EST:''' Maintenance finished successfully. Systems are up. <br />
<br />
'''Tue Feb 1 2022 13:00 EST:''' Maintenance downtime started.<br />
<br />
'''Mon Jan 31 2022 13:15:00 EST:''' The SciNet datacentre's cooling system needs an '''emergency repair''' as soon as possible. During this repair, all systems hosted at SciNet (Niagara, Mist, Rouge, HPSS, and Teach) will need to be switched off and will be unavailable to users. Repairs will start '''Tuesday February 1st, at 1:00 pm EST''', and could take until the end of the next day. Please check here for updates.<br />
<br />
'''Sat Jan 29 2020 16:45:38 EST:''' Fibre repaired.<br />
<br />
'''Sat 29 Jan 2022 11:22:27 EST:''' Fibre repair is underway. Expect to have connectivity restored later today.<br />
<br />
'''Fri 28 Jan 2022 07:35:01 EST:''' The fibre optics cable that connects the SciNet datacentre was severed by uncoordinated digging at York University. We expect repairs to happen as soon as possible.<br />
<br />
'''Thu Jan 27 12:46 EST PM 2022:''' Network issues to and from the datacentre. We are investigating.<br />
<br />
'''Sun Jan 23 11:05 EST AM 2022:''' Filesystem issues appear to have resolved.<br />
<br />
'''Sun Jan 23 10:30 EST AM 2022:''' Filesystem issues -- investigating.<br />
<br />
'''Sat Jan 8 11:42 EST AM 2022:''' The emergency maintenance is complete. Systems are up and available.<br />
<br />
'''Fri Jan 7 14:34 EST PM 2022:''' The SciNet shutdown is in progress. Systems are expected back on Saturday, Jan 8.<br />
<br />
'''<span style="color:red">Emergency shutdown Friday January 7, 2022</span>''': An emergency shutdown of all SciNet to replace a crucial file system component is planned to take place on Friday January 7, 2022, starting at 8am EST, and will require at least 12 hours of downtime. Updates will be posted during the day.<br />
<br />
'''Thu Jan 6 08:20 EST AM 2022''' The SciNet filesystem is having issues. We are investigating.<br />
<br />
'''Fri Dec 24 13:31 EST PM 2021''' Please note the following scheduled network maintenance, which will result in loss of connectivity to the SciNet datacentre: Start time<br />
Dec 29, 00:30 EST Estimated duration 4 hours and 30 minutes. <br />
<br />
'''Mon Dec 20 4:29 EST PM 2021''' Filesystem is back to normal. <br />
<br />
'''Mon Dec 20 2:53 EST PM 2021''' Filesystem problem - We are investigating. <br />
<br />
'''Wed Sep 23 12:30 EDT 2021 ''' Cooling restored. Systems should be available later this afternoon. <br />
<br />
'''Wed Sep 23 9:30 EDT 2021 ''' Technicians on site working on cooling system. <br />
<br />
'''Wed Sep 23 3:30 EDT 2021 ''' Cooling system issues still unresolved. <br />
<br />
'''Wed Sep 22 23:27:48 EDT 2021 ''' Shutdown of the datacenter due to a problem with the cooling system.<br />
<br />
'''Wed Sep 22 09:30 EDT 2021 ''': File system issues, resolved.<br />
<br />
'''Wed Sep 22 07:30 EDT 2021 ''': File system issues, investigating.<br />
<br />
'''Sun Sep 19 10:00 EDT 2021''': Power glitch interrupted all compute jobs; please resubmit any jobs you had running.<br />
<br />
'''Wed Sep 15 17:35 EDT 2021''': filesystem issues resolved<br />
<br />
'''Wed Sep 15 16:39 EDT 2021''': filesystem issues<br />
<br />
'''Mon Sep 13 13:15:07 EDT 2021''' HPSS is back online.<br />
<br />
'''Fri Sep 10 17:57:23 EDT 2021''' HPSS is offline due to unscheduled maintenance.<br />
<br />
'''Wed Aug 18 16:13:42 EDT 2021''' The HPSS upgrade is complete.<br />
<br />
'''HPSS Downtime August 17th and 18th, 2021 (Tuesday and Wednesday):''' We'll be upgrading the HPSS software to version 8.3, along with all the clients (htar/hsi, vfs and Globus/dsi)<br />
<br />
'''July 24, 2021, 6:00 PM EDT:''' There appear to be file system issues, which may affect users' ability to login. We are investigating.<br />
<br />
''' July 23th, 2021, 9:00 AM EDT:''' ''' Security update: ''' Due to a severe vulnerability in the Linux kernel (CVE-2021-33909), our team is currently patching and rebooting all login nodes and compute nodes, as well as the JupyterHub. There should be no affect on running jobs, however sessions on login and datamover nodes will be disrupted. <br />
<br />
''' July 20th, 2021, 7:00 PM EDT:''' ''' SLURM configuration''' - Changed the default behaviour to kill a job step if any task exits with a non-zero exit code. If your code is able to handle failures gracefully, please add srun's option --no-kill to recover the previous default behaviour.<br />
<br />
''' July 20th, 2021, 7:00 PM EDT:''' Maintenance finished, systems are back online. <br />
<br />
'''SciNet Downtime July 20th, 2021 (Tuesday):''' There will be a maintenance shutdown of the SciNet data center on Tuesday July 20th, starting at 7 am EDT. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time. We expect to be able to bring the systems back online in the evening of July 20th. The status of the Niagara cluster can be checked on status.computecanada.ca. For up-to-date and more detailed information on the status of all the SciNet systems, you can always check back here.<br />
<br />
'''June 28th, 2021, 4:06 PM:''' Mist OS upgrade is complete.<br />
<br />
'''May 27, 2021:''' Datamovers addresses have changed to improve high bandwidth connectivity and cybersecurity. The new addresses are 142.1.174.227 for nia-datamover1.scinet.utoronto.ca, and 142.1.174.228 for nia-datamover2.scinet.utoronto.ca.<br />
<br />
If you have jobs that need to connect to a software license server using an ssh tunnel through nia-gw (which actually resolves to datamover1 or datamover2), you may need to ask the system administrators of that license server to allow incoming connections from the new addresses above.<br />
'''June 29th, 2021, 2:00 PM:''' Thunderstorm-related power fluctuations are causing some Niagara compute nodes and their jobs to crash. Please resubmit if your jobs seem to have crashed for no apparent reason.<br />
<br />
'''June 28th, 2021, 4:06 PM:''' Mist OS upgrade is complete.<br />
<br />
'''June 28th, 2021, 9:00 AM:''' Mist is under maintenance. OS upgrading from RHEL 7 to 8.<br />
<br />
'''June 11th, 2021, 8:30 AM:''' Maintenance complete. Systems are up.<br />
<br />
'''June 9th to 10th, 2021:''' The SciNet datacentre will have a scheduled maintenance shutdown. Niagara, Mist, Rouge, HPSS, login nodes, the file systems, and hosted systems will all be offline during the shutdown starting at 7AM EDT on Wednesday June 9th. We expect the systems to be back up in the morning of Friday June 11th. Check here for updates.<br />
<br />
'''May 27, 2021:''' Datamovers addresses have changed to improve high bandwidth connectivity and cybersecurity. The new addresses are 142.1.174.227 for nia-datamover1.scinet.utoronto.ca, and 142.1.174.228 for nia-datamover2.scinet.utoronto.ca.<br />
<br />
'''May 27th, 20:00.''' All systems are up and running <br />
<br />
'''May 27th, 19:30.''' Most systems are up<br />
<br />
'''May 27th, 19:00:''' Cooling is back. Powering up systems<br />
<br />
'''May 27th, 2021, 11:30am:''' The cooling tower issue has been identified as a wiring issue and is being repaired. We don't have an ETA on when cooling will be restored, however we are hopeful it will be by the end of the day. <br />
<br />
'''May 27th, 2021, 12:30am:''' Cooling tower motor is not working properly and may need to be replaced. Its the primary motor and the cooling system can not run without it, so at least until tomorrow all equipment at the datacenter will remain unavailable. Updates about expected repair times will be posted when they are known.<br />
<br />
'''May 26th, 2021, 9:20pm:''' we are currently experiencing cooling issues at the SciNet data centre. Updates will be posted as we determine the cause of the problem.<br />
<br />
'''From Tue Mar 30 at 12 noon EST to Thu Apr 1 at 12 noon EST,''' there will be a two-day reservation for the "Niagara at Scale" pilot event. During these 48 hours, only "Niagara at Scale" projects will run on the compute notes (as well as SOSCIP projects, on a subset of nodes). All other users can still login, access their data, and submit jobs throughout this event, but the jobs will not run until after the event. The debugjob queue will remain available to everyone as well.<br />
<br />
The scheduler will not start batch jobs that cannot finish before the start of this event. Users can submit small and short jobs can take advantage of this, as the scheduler may be able to fit these jobs in before the event starts on the otherwise idle nodes.<br />
<br />
'''Tue 23 Mar 2021 12:19:07 PM EDT''' - Planned external network maintenance 12pm-1pm Tuesday, March 23rd. <br />
<br />
'''Thu Jan 28 17:35:16 EST 2021:''' HPSS services are back online<br />
<br />
'''Thu Jan 28 12:36:21 EST 2021:''' HPSS services offline<br />
<br />
We need a small maintenance window as early as possible still this afternoon to perform a small change in configuration. Ongoing jobs will be allowed to finish, but we are keeping new submissions on hold on the queue.<br />
<br />
'''Mon Jan 25 13:16:33 EST 2021:''' HPSS services are back online<br />
<br />
'''Sat Jan 23 10:03:33 EST 2021:''' HPSS services offline<br />
<br />
We detected some type of hardware failure on our HPSS equipment overnight, so access has been disabled pending further investigation.<br />
<br />
'''Fri Jan 22 10:49:29 EST 2021:''' The Globus transition to oauth is finished<br />
<br />
Please deactivate any previous sessions to the niagara endpoint (in the last 7 days), and activate/login again. <br />
<br />
For more details check https://docs.scinet.utoronto.ca/index.php/Globus#computecandada.23niagara<br />
<br />
'''Jan 21, 2021:''' Globus access disruption on Fri, Jan/22/2021 10AM: Please be advised that we will have a maintenance window starting tomorrow at 10AM to roll out the transition of services to oauth based authentication.<br />
<br />
'''Jan 15, 2021:'''Globus access update on Mon, Jan/18/2021 and Tue, Jan/19/2021:<br />
Please be advised we start preparations on Monday to perform update to Globus access on Tuesday. We'll be adopting oauth instead of myproxy from that point on. During this period expect sporadic disruptions of service. On Monday we'll already block access to nia-dm2, so please refrain from starting new login sessions or ssh tunnels via nia-dm2 from this weekend already.<br />
<br />
''' December 11,2020, 12:00 AM EST: ''' Cooling issue resolved. Systems back.<br />
<br />
''' December 11,2020, 6:00 PM EST: ''' Cooling issue at datacenter. All systems down.<br />
<br />
''' December 7, 2020, 7:25 PM EST: '''All systems back; users can log in again.<br />
<br />
''' December 7, 2020, 6:46 PM EST: '''User connectivity to data center not yet ready, but queued jobs on Mist and Niagara have been started.<br />
<br />
''' December 7, 2020, 7:00 AM EST: '''Maintenance shutdown in effect. This is a one-day maintenance shutdown. There will be no access to Niagara, Mist, HPSS or teach, nor to their file systems during this time. We expect to be able to bring the systems back online this evening.<br />
<br />
''' December 2, 2020, 9:10 PM EST: '''Power is back, systems are coming up. Please resubmit any jobs that failed because of this incident.<br />
<br />
''' December 2, 2020, 6:00 PM EST: '''Power glitch at the data center, caused about half of the compute nodes to go down. Power issue not yet resolved.<br />
<br />
'''<span style="color:#dd1111">Announcing a Maintenance Shutdown on December 7th, 2020</span>''' <br/>There will be a one-day maintenance shutdown on December 7th 2020, starting at 7 am EST. There will be no access to Niagara, Mist, HPSS or teach, nor to their file systems during this time. We expect to be able to bring the systems back online in the evening of the same day.<br />
<br />
''' November 6, 2020, 8:00 PM EST: ''' Systems are coming back online.<br />
<br />
''' November 6, 2020, 9:49 AM EST: ''' Repairs on the cooling system are underway. No ETA, but the systems will likely be back some time today.<br />
<br />
''' November 6, 2020, 4:27 AM EST: '''Cooling system failure, datacentre is shut down.<br />
<br />
''' October 9, 2020, 12:57 PM: ''' A short power glitch caused many of the Niagara compute nodes to lose power; jobs running on them would have failed. Please check your jobs and resubmit.<br />
<br />
''' October 8, 2020, 9:50 PM: ''' Jupyterhub service is back up.<br />
<br />
''' October 8, 2020, 5:40 PM: ''' Jupyterhub service is down. We are investigating.<br />
<br />
''' September 28, 2020, 11:00 AM EST: ''' A short power glitch caused many of the Niagara compute nodes to lose power; jobs running on them would have failed. Please check your jobs and resubmit.<br />
<br />
''' September 1, 2020, 2:15 PM EST: ''' A short power glitch caused about half of the Niagara compute nodes to lose power; jobs running on them would have failed. Please check your jobs and resubmit.<br />
<br />
''' September 1, 2020, 9:27 AM EST: ''' The Niagara cluster has moved to a new default software stack, NiaEnv/2019b. If your job scripts used the previous default software stack before (NiaEnv/2018a), please put the command "module load NiaEnv/2018a" before other module commands in those scripts, to ensure they will continue to work, or try the new stack (recommended).<br />
''' August 24, 2020, 7:37 PM EST: ''' Connectivity is back to normal<br />
<br />
''' August 24, 2020, 6:35 PM EST: ''' We have partial connectivity back, but are still investigating.<br />
<br />
''' August 24, 2020, 3:15 PM EST: ''' There are issues connecting to the data centre. We're investigating.<br />
<br />
''' August 21, 2020, 6:00 PM EST: ''' The pump has been repaired, cooling is restored, systems are up. <br/>Scratch purging is postponed until the evening of Friday Aug 28th, 2020.<br />
<br />
'''August 19, 2020, 4:40 PM EST:''' Update: The current estimate is to have the cooling restored on Friday and we hope to have the systems available for users on Saturday August 22, 2020.<br />
<br />
'''August 17, 2020, 4:00 PM EST:''' Unfortunately after taking the pump apart it was determined there was a more serious failure of the main drive shaft, not just the seal. As a new one will need to be sourced or fabricated we're estimating that it will take at least a few more days to get the part and repairs done to restore cooling. Sorry for the inconvenience. <br />
<br />
'''August 15, 2020, 1:00 PM EST:''' Due to parts availablity to repair the failed pump and cooling system it is unlikely that systems will be able to be restored until Monday afternoon at the earliest. <br />
<br />
'''August 15, 2020, 00:04 AM EST:''' A primary pump seal in the cooling infrastructure has blown and parts availability will not be able be determined until tomorrow. All systems are shut down as there is no cooling. If parts are available, systems may be back at the earliest late tomorrow. Check here for updates. <br />
<br />
'''August 14, 2020, 21:04 AM EST:''' Tomorrow's /scratch purge has been postponed.<br />
<br />
'''August 14, 2020, 21:00 AM EST:''' Staff at the datacenter. Looks like one of the pumps has a seal that is leaking badly.<br />
<br />
'''August 14, 2020, 20:37 AM EST:''' We seem to be undergoing a thermal shutdown at the datacenter.<br />
<br />
'''August 14, 2020, 20:20 AM EST:''' Network problems to niagara/mist. We are investigating.<br />
<br />
'''August 13, 2020, 10:40 AM EST:''' Network is fixed, scheduler and other services are back.<br />
<br />
'''August 13, 2020, 8:20 AM EST:''' We had an IB switch failure, which is affecting a subset of nodes, including the scheduler nodes.<br />
<br />
'''August 10, 2020, 7:30 PM EST:''' Scheduler fully operational again.<br />
<br />
'''August 10, 2020, 3:00 PM EST:''' Scheduler partially functional: jobs can be submitted and are running.<br />
<br />
'''August 10, 2020, 2:00 PM EST:''' Scheduler is temporarily inoperational.<br />
<br />
'''August 7, 2020, 9:15 PM EST:''' Network is fixed, scheduler and other services are coming back.<br />
<br />
'''August 7, 2020, 8:20 PM EST:''' Disruption of part of the network in the data centre. Causes issue with the scheduler, the mist login node, and possibly others. We are investigating.<br />
<br />
'''July 30, 2020, 9:00 AM''' Project backup in progress but incomplete: please be aware that after we deployed the new, larger storage appliance for scratch and project two months ago, we started a full backup of project (1.5PB). This backup is taking a while to complete, and there are still a few areas which have not been backed up fully. Please be careful to not delete things from project that you still need, in particular if they are recently added material.<br />
<br />
'''July 27, 2020, 5:00 PM:''' Scheduler issues resolved.<br />
<br />
'''July 27, 2020, 3:00 PM:''' Scheduler issues. We are investigating.<br />
<br />
'''July 13, 4:40 PM:''' Most systems are available again. Only Mist is still being brought up.<br />
<br />
'''July 13, 10:00 AM:''' '''SciNet/Niagara Downtime In Progress'''<br />
<br />
'''SciNet/Niagara Downtime Announcement, July 13, 2020'''<br/><br />
All resources at SciNet will undergo a maintenance shutdown on Monday July 13, 2020, starting at 10:00 am EDT, for file system and scheduler upgrades. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time.<br />
We expect to be able to bring the systems back around 3 PM (EST) on the same day.<br />
<br />
''' June 29, 6:21:00 PM:''' Systems are available again. <br />
<br />
''' June 29, 12:30:00 PM:''' Power Outage caused thermal shutdown.<br />
<br />
'''June 20, 2020, 10:24 PM:''' File systems are back up. Unfortunately, all running jobs would have died and users are asked to resubmit them.<br />
<br />
'''June 20, 2020, 9:48 PM:''' An issue with the file systems is causing trouble. We are investigating the cause.<br />
<br />
'''June 15, 2020, 10:30 PM:''' A '''power glitch''' caused some compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.<br />
<br />
'''June 12, 2020, 6:15 PM:''' Two '''power glitches''' during the night caused some compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.<br />
<br />
'''June 6, 2020, 6:06 AM:''' A '''power glitch''' caused some compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.<br />
<br />
'''May 24, 2020, 8:20 AM:''' A '''power glitch''' this morning caused all compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.<br />
<br />
'''May 7, 2020, 6:05 PM:''' Maintenance shutdown is finished. Most systems are back in production.<br />
<br />
'''May 6, 2020, 7:08 AM:''' Two-day datacentre maintenance shutdown has started.<br />
<br />
''' SciNet/Niagara Downtime Announcement, May 6-7, 2020'''<br />
<br />
All resources at SciNet will undergo a two-day maintenance shutdown on May 6th and 7th 2020, starting at 7 am EDT on Wednesday May 6th. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) or systems hosted at the SciNet data centre. We expect to be able to bring the systems back online the evening of May 7th.<br />
<br />
'''May 4, 2020, 7:51 AM:''' A power glitch this morning caused compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.<br />
<br />
'''May 3, 2020, 8:20 AM:''' A power glitch this morning caused all compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.<br />
<br />
'''April 28, 2020, 7:20 AM:''' A power glitch this morning caused all compute nodes to be rebooted: jobs running at the time have failed; users are asked to resubmit these jobs.<br />
<br />
'''April 20, 2020: Security Incident at Cedar; implications for Niagara users'''<br />
<br />
Last week, it became evident that the Cedar GP cluster had been<br />
comprimised for several weeks. The passwords of at least two<br />
Compute Canada users were known to the attackers. One of these was<br />
used to escalate privileges on Cedar, as explained on<br />
https://status.computecanada.ca/view_incident?incident=423.<br />
<br />
These accounts were used to login to Niagara as well, but Niagara<br />
did not have the same security loophole as Cedar (which has been<br />
fixed), and no further escalation was observed on Niagara.<br />
<br />
Reassuring as that may sound, it is not known how the passwords of<br />
the two user accounts were obtained. Given this uncertainty, the<br />
SciNet team *strongly* recommends that you change your password on<br />
https://ccdb.computecanada.ca/security/change_password, and remove<br />
any SSH keys and regenerate new ones (see<br />
https://docs.scinet.utoronto.ca/index.php/SSH_keys).<br />
<br />
''' Tue 30 Mar 2020 14:55:14 EDT''' Burst Buffer available again.<br />
<br />
''' Fri Mar 27 15:29:00 EDT 2020:''' SciNet systems are back up. Only the Burst Buffer remains offline, its maintenance is expected to be finished early next week.<br />
<br />
''' Thu Mar 26 23:05:00 EDT 2020:''' Some aspects of the maintenance took longer than expected. The systems will not be back up until some time tomorrow, Friday March 27, 2020. <br />
<br />
''' Wed Mar 25 7:00:00 EDT 2020:''' SciNet/Niagara downtime started.<br />
<br />
''' Mon Mar 23 18:45:10 EDT 2020:''' File system issues were resolved.<br />
<br />
''' Mon Mar 23 18:01:19 EDT 2020:''' There is currently an issue with the main Niagara filesystems. This effects all systems, all jobs have been killed. The issue is being investigated. <br />
<br />
''' Fri Mar 20 13:15:33 EDT 2020: ''' There was a power glitch at the datacentre at 8:50 AM, which resulted in jobs getting killed. Please resubmit failed jobs. <br />
<br />
''' COVID-19 Impact on SciNet Operations, March 18, 2020'''<br />
<br />
Although the University of Toronto is closing of some of its<br />
research operations on Friday March 20 at 5 pm EDT, this does not<br />
affect the SciNet systems (such as Niagara, Mist, and HPSS), which<br />
will remain operational.<br />
<br />
''' SciNet/Niagara Downtime Announcement, March 25-26, 2020'''<br />
<br />
All resources at SciNet will undergo a two-day maintenance shutdown on March 25th and 26th 2020, starting at 7 am EDT on Wednesday March 25th. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time.<br />
<br />
This shutdown is necessary to finish the expansion of the Niagara cluster and its storage system.<br />
<br />
We expect to be able to bring the systems back online the evening of March 26th.<br />
<br />
''' March 9, 2020, 11:24 PM:''' HPSS services are temporarily suspended for emergency maintenance.<br />
<br />
''' March 7, 2020, 10:15 PM:''' File system issues have been cleared.<br />
<br />
''' March 6, 2020, 7:30 PM:''' File system issues; we are investigating<br />
<br />
''' March 2, 2020, 1:30 PM:''' For the extension of Niagara, the operating system on all Niagara nodes has been upgraded<br />
from CentOS 7.4 to 7.6. This required all<br />
nodes to be rebooted. Running compute jobs are allowed to finish<br />
before the compute node gets rebooted. Login nodes have all been rebooted, as have the datamover nodes and the jupyterhub service.<br />
<br />
''' Feb 24, 2020, 1:30PM: ''' The [[Mist]] login node got rebooted. It is back, but we are still monitoring the situation.<br />
<br />
''' Feb 12, 2020, 11:00AM: ''' The [[Mist]] GPU cluster now available to users.<br />
<br />
''' Feb 11, 2020, 2:00PM: ''' The Niagara compute nodes were accidentally rebooted, killing all running jobs.<br />
<br />
''' Feb 10, 2020, 19:00PM: ''' HPSS is back to normal.<br />
<br />
''' Jan 30, 2020, 12:01PM: ''' We are having an issue with HPSS, in which the disk-cache is full. We put a reservation on the whole system (Globus, plus archive and vfs queues), until it has had a chance to clear some space on the cache.<br />
<br />
''' Jan 21, 2020, 4:05PM: ''' The was a partial power outage the took down a large amount of the compute nodes. If your job died during this period please resubmit. <br />
<br />
'''Jan 13, 2020, 7:35 PM:''' Maintenance finished.<br />
<br />
'''Jan 13, 2020, 8:20 AM:''' The announced maintenance downtime started (see below).<br />
<br />
'''Jan 9 2020, 11:30 AM:''' External ssh connectivity restored, issue related to the university network.<br />
<br />
'''Jan 9 2020, 9:24 AM:''' We received reports of users having trouble connecting into the SciNet data centre; we're investigating. Systems are up and running and jobs are fine.<p><br />
As a work around, in the meantime, it appears to be possible to log into graham, cedar or beluga, and then ssh to niagara.</p><br />
<br />
'''Downtime announcement:'''<br />
To prepare for the upcoming expansion of Niagara, there will be a<br />
one-day maintenance shutdown on '''January 13th 2020, starting at 8 am<br />
EST'''. There will be no access to Niagara, Mist, HPSS or teach, nor<br />
to their file systems during this time.<br />
<br />
2019<br />
<br />
'''December 13, 9:00 AM EST:''' Issues resolved.<br />
<br />
'''December 13, 8:20 AM EST:''' Overnight issue is now preventing logins to Niagara and other services. Possibly a file system issue, we are investigating.<br />
<br />
<p> '''Fri, Nov 15 2019, 11:00 PM (EST)''' Niagara and most of the main systems are now available. <br />
</p><p> '''Fri, Nov 15 2019, 7:50 PM (EST)''' SOSCIP GPU cluster is up and accessible. Work on the other systems continues.<br />
</p><p> '''Fri, Nov 15 2019, 5:00 PM (EST)''' Infrastructure maintenance done, upgrades still in process.<br />
</p><p><br />
'''Fri, Nov 15 2019, 7:00 AM (EST)''' Maintenance shutdown of the SciNet data centre has started. Note: scratch purging has been postponed until Nov 17.<br/> <br />
</p><br />
<p><br />
'''Announcement:''' <br />
The SciNet datacentre will undergo a maintenance shutdown on<br />
Friday November 15th 2019, from 7 am to 11 pm (EST), with no access<br />
to any of the SciNet systems (Niagara, P8, SGC, HPSS, Teach cluster,<br />
or the filesystems) during that time. <br />
<br />
<br />
'''Sat, Nov 2 2019, 1:30 PM (update):''' Chiller has been fixed, all systems are operational. <br />
</p><br />
'''Fri, Nov 1 2019, 4:30 PM (update):''' We are operating in free cooling so have brought up about 1/2 of the Niagara compute nodes to reduce the cooling load. Access, storage, and other systems should now be available. <br />
<br />
'''Fri, Nov 1 2019, 12:05 PM (update):''' A power module in the chiller has failed and needs to be replaced. We should be able to operate in free cooling if the temperature stays cold enough, but we may not be able to run all systems. No eta yet on when users will be able to log back in. <br />
<br />
'''Fri, Nov 1 2019, 9:15 AM (update):''' There was a automated shutdown because of rising temperatures, causing all systems to go down. We are investigating, check here for updates.<br />
<br />
<p>'''Fri, Nov 1 2019, 8:16 AM:''' Unexpected data centre issue: Check here for updates.<br />
</p><br />
<br />
''' Thu 1 Aug 2019 5:00:00 PM ''' Systems are up and operational. <br />
<br />
'''Thu 1 Aug 2019 7:00:00 AM: ''' Scheduled Downtime Maintenance of the SciNet Datacenter. All systems will be down and unavailable starting 7am until the evening. <br />
<br />
'''Fri 26 Jul 2019, 16:02:26 EDT:''' There was an issue with the Burst Buffer at around 3PM, and it was recently solved. BB is OK again.<br />
<br />
''' Sun 30 Jun 2019 ''' The '''SOSCIP BGQ''' and '''P7''' systems were decommissioned on '''June 30th, 2019'''. The BGQdev front end node and storage are still available. <br />
<br />
'''Wed 19 Jun 2018, 1:20:00 PM:''' The BGQ is back online.<br />
<br />
'''Wed 19 Jun 2018, 10:00:00 AM:''' The BGQ is still down, the SOSCIP GPU nodes should be back up. <br />
<br />
'''Wed 19 Jun 2018, 1:40:00 AM:''' There was an issue with the SOSCIP BGQ and GPU Cluster last night about 1:42am, probably a power fluctuation that took it down. <br />
<br />
'''Wed 12 Jun 2019, 3:30 AM - 7:40 AM''' Intermittent system issues on Niagara's project and scratch as the file number limit was reached. We increased the number of files allowed in total on the file system. <br />
<br />
'''Thu 30 May 2019, 11:00:00 PM:'''<br />
The maintenance downtime of SciNet's data center has finished, and systems are being brought online now. You can check the progress here. Some systems might not be available until Friday morning.<br/><br />
Some action on the part of users will be required when they first connect again to a Niagara login nodes or datamovers. This is due to the security upgrade of the Niagara cluster, which is now in line with currently accepted best practices.<br/><br />
The details of the required actions can be found on the [[SSH Changes in May 2019]] wiki page.<br />
<br />
'''Wed 29-30 May 2019''' The SciNet datacentre will undergo a two-day maintenance shutdown, starting at 7 am EDT on Wednesday May 29th. There will be no access to any of the SciNet systems (Niagara, P7, P8, BGQ, SGC, HPSS, Teach cluster, or the file systems) during this time.<br />
<br />
'''SCHEDULED SHUTDOWN''': <br />
<br />
Please be advised that on '''Wednesday May 29th through Thursday May 30th''', the SciNet datacentre will undergo a two-day maintenance shutdown, starting at 7 am EDT on Wednesday May 29th. There will be no access to any of the SciNet systems (Niagara, P7, P8, BGQ, SGC, HPSS, Teach cluster, or the file systems) during this time.<br />
<br />
This is necessary to finish the installation of an emergency power generator, to perform the annual cooling tower maintenance, and to enhance login security.<br />
<br />
We expect to be able to bring the systems back online the evening of May 30th. Due to the enhanced login security, the ssh applications of users will need to update their known host list. More detailed information on this procedure will be sent shortly before the systems are back online.<br />
<br />
'''Fri 5 Apr 2019:''' Software updates on Niagara: The default CCEnv software stack now uses avx512 on Niagara, and there is now a NiaEnv/2019b stack ("epoch"). <br />
<br />
'''Thu 4 Apr 2019:''' The 2019 compute and storage allocations have taken effect on Niagara.<br />
<br />
'''NOTE''': There is scheduled network maintenance for '''Friday April 26th 12am-8am''' on the Scinet datacenter external network connection. This will not affect internal connections and running jobs however remote connections may see interruptions during this period.<br />
<br />
'''Wed 24 Apr 2019 14:14 EDT:''' HPSS is back on service. Library and robot arm maintenance finished.<br />
<br />
'''Wed 24 Apr 2019 08:35 EDT:''' HPSS out of service this morning for library and robot arm maintenance.<br />
<br />
'''Fri 19 Apr 2019 17:40 EDT:''' HPSS robot arm has been released and is back to normal operations.<br />
<br />
'''Fri 19 Apr 2019 14:00 EDT:''' problems with HPPS library robot have been detected.<br />
<br />
'''Wed 17 Apr 2019 15:35 EDT:''' Network connection is back.<br />
<br />
'''Wed 17 Apr 2019 15:12 EDT:''' Network connection down. Investigating.<br />
<br />
'''Tue 9 Apr 2019 22:24:14 EDT:''' Network connection restored.<br />
<br />
'''Tue 9 Apr 2019, 15:20:''' Network connection down. Investigating.<br />
<br />
'''Fri 5 Apr 2019:''' Planned, short outage in connectivity to the SciNet datacentre from 7:30 am to 8:55 am EST for maintenance of the network. This outage will not affect running or queued jobs. It may be necessary to reboot the login nodes at some point tomorrow, which could result in a short interruption of connectivity, but which will have no effect on running or queued jobs.<br />
<br />
'''April 4, 2019:''' The 2019 compute and storage allocations will take effect on Niagara. Running jobs will not be affected by this change and will run their course. Queued jobs' priorities will be updated to reflect the new fairshare values later in the day. The queue should fully reflect the new fairshare values in about 24 hours. <br />
<br />
It may be necessary to reboot the login nodes at some point tomorrow, which could result in a short interruption of connectivity, but which will have no effect on running or queued jobs.<br />
<br />
There will be updates to the software stack on this day as well.<br />
<br />
'''March 25, 3:05 PM EST:''' Most systems back online, other services should be back shortly. <br />
<br />
'''March 25, 12:05 PM EST:''' Power is back at the datacentre, but it is not yet known when all systems will be back up. Keep checking here for updates.<br />
<br />
'''March 25, 11:27 AM EST:''' A power outage in the datacentre occured and caused all services to go down. Check here for updates.<br />
<br />
'''Thu Mar 21 10:37:28 EDT 2019:''' HPSS is back in service<br />
<br />
HPSS out of service on '''Tue, Mar/19 at 9AM''', for tape library expansion and relocation. It's possible the downtime will extend to Wed, Mar/20.<br />
<br />
'''January 21, 4:00 PM''': HPSS is back in service. Thank you for your patience.<br />
<br />
'''January 18, 5:00 PM''': We did practically all of the HPSS upgrades (software/hardware), however the main client node - archive02 - is presenting an issue we just couldn't resolve yet. We will try to resume work over the weekend with cool heads, or on Monday. Sorry, but this is an unforeseen delay. Jobs on the queue we'll remain there, and we'll delay the scratch purging by 1 week. <br><br />
<br />
'''January 16, 11:00 PM''': HPSS is being upgraded, as announced. <br><br />
<br />
'''January 16, 8:00 PM''': System are coming back up and should be accessible for users now.<br><br />
<br />
'''January 15, 8:00 AM''': Data centre downtime in effect.<br><br />
<br />
<font color=red><b>Downtime Announcement for January 15 and 16, 2019</b></font><br><br />
The SciNet datacentre will need to undergo a two-day maintenance shutdown in order to perform electrical work, repairs and maintenance. The electrical work is in preparation for the upcoming installation of an emergency power generator and a larger UPS, which will result in increased resilience to power glitches and outages. The shutdown is scheduled to start on '''Tuesday January 15, 2019, at 7 am''' and will last until '''Wednesday 16, 2019''', some time in the evening. There will be no access to any of the SciNet systems (Niagara, P7, P8, BGQ, SGC, HPSS, Teach cluster, or the filesystems) during this time.<br />
Check back here for up-to-date information on the status of the systems.<br />
<br />
Note: this downtime was originally scheduled for Dec. 18, 2018, but has been postponed and combined with the annual maintenance downtime.<br />
<br />
'''December 24, 2018, 11:35 AM EST:''' Most systems are operational again. If you had compute jobs running yesterday at around 3:30PM, they likely crashed - please check them and resubmit if needed.<br />
<br />
'''December 24, 2018, 10:40 AM EST:''' Repairs have been made, and the file systems are starting to be mounted on the cluster. <br />
<br />
'''December 23, 2018, 3:38 PM EST:''' Issues with the file systems (home, scratch and project). We are investigating, it looks like a hardware issue that we are trying to work around. Note that the absence of /home means you cannot log in with ssh keys. All compute jobs crashed around 3:30 PM EST on Dec 23. Once the system is properly up again, please resubmit your jobs. Unfortunately, at this time of year, it is not possible to give an estimate on when the system will be operational again.<br />
<br />
'''Tue Nov 22 14:20:00 EDT 2018''': <font color=green>HPSS back in service</font><br />
<br />
'''Tue Nov 22 08:55:00 EDT 2018''': <font color=red>HPSS offline for scheduled maintenance</font><br />
<br />
'''Tue Nov 20 16:30:00 EDT 2018''': HPSS offline on Thursday 9AM for installation of new LTO8 drives in the tape library.<br />
<br />
'''Tue Oct 9 12:16:00 EDT 2018''': BGQ compute nodes are up. <br />
<br />
'''Sun Oct 7 20:24:26 EDT 2018''': SGC and BGQ front end are available, BGQ compute nodes down related to a cooling issue. <br />
<br />
'''Sat Oct 6 23:16:44 EDT 2018''': There were some problems bringing up SGC & BGQ, they will remain offline for now.<br />
<br />
'''Sat Oct 6 18:36:35 EDT 2018''': Electrical work finished, power restored. Systems are coming online.<br />
<br />
'''July 18, 2018:''' login.scinet.utoronto.ca is now disabled, GPC $SCRATCH and $HOME are decommissioned.<br />
<br />
'''July 12, 2018:''' There was a short power interruption around 10:30 am which caused most of the systems (Niagara, SGC, BGQ) to reboot and any running jobs to fail. <br />
<br />
'''July 11, 2018:''' P7's moved to BGQ filesystem, P8's moved to Niagara filesystem.<br />
<br />
'''May 24, 2018, 9:25 PM EST:''' The data center is up, and all systems are operational again.<br />
<br />
'''May 24, 2018, 7:00 AM EST:''' The data centre is under annual maintenance. All systems are offline. Systems are expected to be back late afternoon today; check for updates on this page.<br />
<br />
'''May 18, 2018:''' Announcement: Annual scheduled maintenance downtime: Thursday May 24, starting 7:00 AM<br />
<br />
'''May 16, 2018:''' Cooling restored, systems online<br />
<br />
'''May 16, 2018:''' Cooling issue at datacentre again, all systems down<br />
<br />
'''May 15, 2018:''' Cooling restored, systems coming online<br />
<br />
'''May 15, 2018''' Cooling issue at datacentre, all systems down<br />
<br />
'''May 4, 2018:''' [[HPSS]] is now operational on Niagara.<br />
<br />
'''May 3, 2018:''' [[Burst Buffer]] is available upon request.<br />
<br />
'''May 3, 2018:''' The [https://docs.computecanada.ca/wiki/Globus Globus] endpoint for Niagara is available: computecanada#niagara.<br />
<br />
'''May 1, 2018:''' System status moved he here.<br />
<br />
'''Apr 23, 2018:''' GPC-compute is decommissioned, GPC-storage available until 30 May 2018.<br />
<br />
'''April 10, 2018:''' Niagara commissioned.</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=4167
Main Page
2022-09-16T13:19:07Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up", "Partial" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Down |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|}<br />
<br />
'''Fri Sept 16, 2022, 9:00 AM EDT:''' Login nodes are not accessible. We are investigating.<br />
<br />
'''Tue Sep 13, 2022, 11:00 AM EDT:''' Mist login node is available again.<br />
<br />
'''Tue Sep 13, 2022, 10:00 AM EDT:''' Mist login node is under maintenance and temporarily inaccessible to users.<br />
<br />
'''Fri Sep 2, 2022, 11:25 AM EDT:''' Rouge login node is back up.<br />
<br />
'''Fri Sep 2, 2022, 10:25 AM EDT:''' Issues with the Rouge login node; we are investigating.<br />
<br />
'''Tue Aug 23, 2022, 1:15 PM EDT:''' Jupyter Hub is available again.<br />
<br />
'''Tue Aug 23, 2022, 1:00 PM EDT:''' Jupyter Hub is being updated and temporarily inaccessible to users.<br />
<br />
<!-- When removing system status entries, please archive them to: --><br />
[[Previous messages]]<br />
<br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH#SSH Keys|SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=4116
Main Page
2022-08-11T13:26:27Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up", "Partial" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|}<br />
<br />
'''Thu Aug 11, 2022, 9:20 AM EDT:''' The login node issues have been resolved.<br />
<br />
'''Thu Aug 11, 2022, 7:50 AM EDT:''' We are having problems accessing the Niagara login nodes. Until fixed, please login to Mist and then ssh to a Niagara login node to access Niagara ("ssh nia-login02", for example).<br />
<br />
'''Fri July 15, 2022, 10:50 AM EDT:''' Jupyter Hub is available again.<br />
<br />
'''Fri July 15, 2022, 10:30 AM EDT:''' Jupyter Hub is being updated and temporarily inaccessible to users.<br />
<br />
'''Wed June 16, 2022, 3:45 PM EDT:''' File system is stable now. We're gradually opening the systems up.<br />
<br />
'''Wed June 16, 2022, 10:15 AM EDT:''' Emergency maintenance shutdown of filesystem. Running jobs will be affected.<br />
<br />
'''Wed June 15, 2022, 7:35 PM EDT:''' Maintenance shutdown finished. Most systems are available again.<br />
<br />
'''Wed June 15, 2022, 7:00 AM EDT:''' Maintenance shutdown of the SciNet datacentre. There will be no access to any of the SciNet systems during this time. We expect to be able to bring the systems back online in the evening of June 15th.<br />
<br />
<!-- When removing system status entries, please archive them to: --><br />
[[Previous messages]]<br />
<br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH#SSH Keys|SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=4113
Main Page
2022-08-11T12:05:56Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up", "Partial" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Partial |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Partial |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|}<br />
<br />
'''Thu Aug 11, 2022, 7:50 AM EDT:''' We are having problems accessing the Niagara login nodes. Until fixed, please login to Mist and then ssh to a Niagara login node to access Niagara ("ssh nia-login02", for example).<br />
<br />
'''Fri July 15, 2022, 10:50 AM EDT:''' Jupyter Hub is available again.<br />
<br />
'''Fri July 15, 2022, 10:30 AM EDT:''' Jupyter Hub is being updated and temporarily inaccessible to users.<br />
<br />
'''Wed June 16, 2022, 3:45 PM EDT:''' File system is stable now. We're gradually opening the systems up.<br />
<br />
'''Wed June 16, 2022, 10:15 AM EDT:''' Emergency maintenance shutdown of filesystem. Running jobs will be affected.<br />
<br />
'''Wed June 15, 2022, 7:35 PM EDT:''' Maintenance shutdown finished. Most systems are available again.<br />
<br />
'''Wed June 15, 2022, 7:00 AM EDT:''' Maintenance shutdown of the SciNet datacentre. There will be no access to any of the SciNet systems during this time. We expect to be able to bring the systems back online in the evening of June 15th.<br />
<br />
<!-- When removing system status entries, please archive them to: --><br />
[[Previous messages]]<br />
<br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH#SSH Keys|SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=4110
Main Page
2022-08-11T12:01:14Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up", "Partial" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Partial |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Partial |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|}<br />
<br />
'''Thu Aug 11, 2022, 7:50 AM EDT:''' We are having problems accessing the Niagara login nodes. Until fixed, please login to Mist and then ssh to a Niagara login node to access Niagara.<br />
<br />
'''Fri July 15, 2022, 10:50 AM EDT:''' Jupyter Hub is available again.<br />
<br />
'''Fri July 15, 2022, 10:30 AM EDT:''' Jupyter Hub is being updated and temporarily inaccessible to users.<br />
<br />
'''Wed June 16, 2022, 3:45 PM EDT:''' File system is stable now. We're gradually opening the systems up.<br />
<br />
'''Wed June 16, 2022, 10:15 AM EDT:''' Emergency maintenance shutdown of filesystem. Running jobs will be affected.<br />
<br />
'''Wed June 15, 2022, 7:35 PM EDT:''' Maintenance shutdown finished. Most systems are available again.<br />
<br />
'''Wed June 15, 2022, 7:00 AM EDT:''' Maintenance shutdown of the SciNet datacentre. There will be no access to any of the SciNet systems during this time. We expect to be able to bring the systems back online in the evening of June 15th.<br />
<br />
<!-- When removing system status entries, please archive them to: --><br />
[[Previous messages]]<br />
<br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH#SSH Keys|SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=4107
Main Page
2022-08-11T12:00:54Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up", "Partial" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Partial |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Partial |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|}<br />
<br />
'''Thu Aug 11, 2022, 7:50 AM EDT:''' We are having problems with accessing to the Niagara login nodes. Until fixed, please login to Mist and then ssh to a Niagara login node to access Niagara.<br />
<br />
'''Fri July 15, 2022, 10:50 AM EDT:''' Jupyter Hub is available again.<br />
<br />
'''Fri July 15, 2022, 10:30 AM EDT:''' Jupyter Hub is being updated and temporarily inaccessible to users.<br />
<br />
'''Wed June 16, 2022, 3:45 PM EDT:''' File system is stable now. We're gradually opening the systems up.<br />
<br />
'''Wed June 16, 2022, 10:15 AM EDT:''' Emergency maintenance shutdown of filesystem. Running jobs will be affected.<br />
<br />
'''Wed June 15, 2022, 7:35 PM EDT:''' Maintenance shutdown finished. Most systems are available again.<br />
<br />
'''Wed June 15, 2022, 7:00 AM EDT:''' Maintenance shutdown of the SciNet datacentre. There will be no access to any of the SciNet systems during this time. We expect to be able to bring the systems back online in the evening of June 15th.<br />
<br />
<!-- When removing system status entries, please archive them to: --><br />
[[Previous messages]]<br />
<br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH#SSH Keys|SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=4104
Main Page
2022-08-11T11:58:47Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up", "Partial" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Partial |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|}<br />
<br />
'''Thu Aug 11, 2022, 7:50 AM EDT:''' We are having problems with accessing to the Niagara login nodes. Until fixed, please login to Mist and then ssh to a Niagara login node to access Niagara.<br />
<br />
'''Fri July 15, 2022, 10:50 AM EDT:''' Jupyter Hub is available again.<br />
<br />
'''Fri July 15, 2022, 10:30 AM EDT:''' Jupyter Hub is being updated and temporarily inaccessible to users.<br />
<br />
'''Wed June 16, 2022, 3:45 PM EDT:''' File system is stable now. We're gradually opening the systems up.<br />
<br />
'''Wed June 16, 2022, 10:15 AM EDT:''' Emergency maintenance shutdown of filesystem. Running jobs will be affected.<br />
<br />
'''Wed June 15, 2022, 7:35 PM EDT:''' Maintenance shutdown finished. Most systems are available again.<br />
<br />
'''Wed June 15, 2022, 7:00 AM EDT:''' Maintenance shutdown of the SciNet datacentre. There will be no access to any of the SciNet systems during this time. We expect to be able to bring the systems back online in the evening of June 15th.<br />
<br />
<!-- When removing system status entries, please archive them to: --><br />
[[Previous messages]]<br />
<br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH#SSH Keys|SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=4101
Main Page
2022-08-11T11:52:18Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|}<br />
<br />
'''Thu Aug 11, 2022, 7:50 AM EDT:''' We are having problems with accessing to the Niagara login nodes. Until fixed, please login to Mist and then ssh to a Niagara login node to access Niagara.<br />
<br />
'''Fri July 15, 2022, 10:50 AM EDT:''' Jupyter Hub is available again.<br />
<br />
'''Fri July 15, 2022, 10:30 AM EDT:''' Jupyter Hub is being updated and temporarily inaccessible to users.<br />
<br />
'''Wed June 16, 2022, 3:45 PM EDT:''' File system is stable now. We're gradually opening the systems up.<br />
<br />
'''Wed June 16, 2022, 10:15 AM EDT:''' Emergency maintenance shutdown of filesystem. Running jobs will be affected.<br />
<br />
'''Wed June 15, 2022, 7:35 PM EDT:''' Maintenance shutdown finished. Most systems are available again.<br />
<br />
'''Wed June 15, 2022, 7:00 AM EDT:''' Maintenance shutdown of the SciNet datacentre. There will be no access to any of the SciNet systems during this time. We expect to be able to bring the systems back online in the evening of June 15th.<br />
<br />
<!-- When removing system status entries, please archive them to: --><br />
[[Previous messages]]<br />
<br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH#SSH Keys|SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=4065
Main Page
2022-06-17T13:00:35Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down |Globus |Globus}}<br />
|}<br />
<br />
'''Wed June 16, 2022, 3:45 PM EDT:''' File system is stable now. We're gradually opening the systems up.<br />
<br />
'''Wed June 16, 2022, 10:15 AM EDT:''' Emergency maintenance shutdown of filesystem. Running jobs will be affected.<br />
<br />
'''Wed June 15, 2022, 7:35 PM EDT:''' Maintenance shutdown finished. Most systems are available again.<br />
<br />
'''Wed June 15, 2022, 7:00 AM EDT:''' Maintenance shutdown of the SciNet datacentre. There will be no access to any of the SciNet systems during this time. We expect to be able to bring the systems back online in the evening of June 15th.<br />
<br />
<!-- When removing system status entries, please archive them to: --><br />
[[Previous messages]]<br />
<br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH#SSH Keys|SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=4062
Main Page
2022-06-17T12:59:38Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Down |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down |Globus |Globus}}<br />
|}<br />
<br />
'''Wed June 16, 2022, 3:45 PM EDT:''' File system is stable now. We're gradually opening the systems up.<br />
<br />
'''Wed June 16, 2022, 10:15 AM EDT:''' Emergency maintenance shutdown of filesystem. Running jobs will be affected.<br />
<br />
'''Wed June 15, 2022, 7:35 PM EDT:''' Maintenance shutdown finished. Most systems are available again.<br />
<br />
'''Wed June 15, 2022, 7:00 AM EDT:''' Maintenance shutdown of the SciNet datacentre. There will be no access to any of the SciNet systems during this time. We expect to be able to bring the systems back online in the evening of June 15th.<br />
<br />
<!-- When removing system status entries, please archive them to: --><br />
[[Previous messages]]<br />
<br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH#SSH Keys|SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=MATLAB&diff=3671
MATLAB
2022-03-25T14:25:23Z
<p>Ejspence: /* Using a MATLAB stand-alone executable on Niagara */</p>
<hr />
<div>We often get questions about running MATLAB on Niagara. With a few exceptions for compilers and debuggers, SciNet does not purchase licenses for commercial software. As such, SciNet does not have a license for MATLAB, nor will it in the future. If users wish to run MATLAB they must supply their own license, or explore alternative options. This page gives information about the options for getting your MATLAB code to run, in recommended order. <br />
<br />
<br />
== Not using MATLAB ==<br />
<br />
Users can attempt to run MATLAB code using the open-source program [[Octave]], accessible through the octave module. Though there are some differences between the two programs, Octave has been designed to interpret MATLAB code and can often be used in place of MATLAB. If your MATLAB code does not use some of the more-fancy MATLAB toolboxes, you may be able to get away with using Octave instead. Be sure to test your implementation in Octave thoroughly before committing to this option.<br />
<br />
It is worth observing that, while convenient for prototyping and running on a single workstation, there are reasons to avoid using MATLAB for larger HPC/ARC projects. These include the prohibitive license cost for large-scale work, poor performance at scale, and portability issues. If you can switch to a license-free option, such as Python, it may be worth the effort.<br />
<br />
== Using stand-alone MATLAB executables ==<br />
<br />
=== Creating a MATLAB stand-alone executable ===<br />
<br />
If MATLAB must be used, you may be able to compile your MATLAB code into a stand-alone executable, and run this on a Niagara compute node. The version of MATLAB being used will require a compiler license, and the compilation must be done on a Linux machine (not Niagara). <br />
<br />
Note: The compilation of a matlab code produces two files, e.g., compiling <tt>myscript.m</tt> will produce a file <tt>myscript</tt> and a file <tt>run_myscripts.sh</tt>. You will need both files.<br />
<br />
=== Using a MATLAB stand-alone executable on Niagara ===<br />
<br />
Once the compilation is done, the executable and its run script (<tt>run_SOMETHING.sh</tt>) can be copied to SciNet, and run using the MATLAB Compiler Runtime (MCR), which can be accessed using the MCR module. The MCR used must be the same version of MATLAB as the compiler. If the version of MCR that you need is not listed among the MCR module versions, contact us and we will install the version which you require.<br />
<br />
Here is an example script which uses the MCR. Note that the "run_myscript.sh" script is produced by the MATLAB compiler, together with the "myscript" executable (assuming you were working on the myscript.m MATLAB code):<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --cpus-per-task=40<br />
#SBATCH --time=1:00:00<br />
#SBATCH --job-name test_matlab<br />
#SBATCH --output=matlab_output_%j.txt<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is the directory from which the job was submitted<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
# load module<br />
module load mcr/R2018a<br />
<br />
# Directory for the MCR to use to write temporary files. Use whatever directory you wish.<br />
# We specify a unique directory in case multiple jobs are running simultaneously.<br />
mkdir -p $SCRATCH/temp/$SLURM_JOB_ID<br />
export MCR_CACHE_ROOT=$SCRATCH/temp/$SLURM_JOB_ID<br />
<br />
# EXECUTION COMMAND (note that the MATLAB script may require that LD_LIBRARY_PATH be added<br />
# to the script arguments). Note that, if the calculations are serial, you must bundle 40 such<br />
# calculations together for production runs!<br />
./run_myscript.sh $MATLAB:$LD_LIBRARY_PATH<br />
<br />
rm -rf $SCRATCH/temp/$SLURM_JOB_ID<br />
</source><br />
<br />
=== Available MATLAB runtime versions ===<br />
<br />
Below is a list of the available MATLAB runtime versions on Niagara, with the required module command:<br />
<br />
module load NiaEnv/2018a mcr/R2018a<br />
module load NiaEnv/2019b mcr/R2019a<br />
module load CCEnv nixpkgs/16.09 mcr/R2013a<br />
module load CCEnv nixpkgs/16.09 mcr/R2013b<br />
module load CCEnv nixpkgs/16.09 mcr/R2014a<br />
module load CCEnv nixpkgs/16.09 mcr/R2014b<br />
module load CCEnv nixpkgs/16.09 mcr/R2015a<br />
module load CCEnv nixpkgs/16.09 mcr/R2015b<br />
module load CCEnv nixpkgs/16.09 mcr/R2016a<br />
module load CCEnv nixpkgs/16.09 mcr/R2016b<br />
module load CCEnv nixpkgs/16.09 mcr/R2017a<br />
module load CCEnv nixpkgs/16.09 mcr/R2017b<br />
module load CCEnv nixpkgs/16.09 mcr/R2018a<br />
module load CCEnv nixpkgs/16.09 mcr/R2018b<br />
module load CCEnv nixpkgs/16.09 mcr/R2019a<br />
module load CCEnv nixpkgs/16.09 mcr/R2019b<br />
module load CCEnv gentoo/2020 mcr/R2020b<br />
<br />
== Tunneling to a license server ==<br />
<br />
If you have access to a non-SciNet MATLAB license server, and have installed MATLAB in your $HOME directory, you can [[SSH_Tunneling|setup your submission script]] to access the external license server. The following lines should be added to the beginning of your submission script, after the #SBATCH commands:<br />
<br />
<pre><br />
PORT=XXX # port number of the license server<br />
LICENSE_IP=AAA.BBB.CCC.DDD # IP address of the license server<br />
ssh nia-gw -L${PORT}:${LICENSE_IP}:${PORT} -N &<br />
</pre><br />
<br />
This last line will tunnel the port from the compute node back to the license server, through nia-gw. The port number and IP address of the licence server must be supplied by the system administrator of the license server.<br />
<br />
== Using a different Consortium ==<br />
<br />
Both [https://www.sharcnet.ca/my/software/show/54 Sharcnet] and [https://www.westgrid.ca/support/software/matlab Westgrid] have purchased different types of MATLAB licenses. Users can contact those consortia if they wish to attempt to run on those systems.</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Mist&diff=3638
Mist
2022-03-09T14:55:33Z
<p>Ejspence: /* Limits */</p>
<hr />
<div>{{Infobox Computer<br />
|image=[[Image:Mist.jpg|center|300px|thumb]]<br />
|name=Mist<br />
|installed=Dec 2019<br />
|operatingsystem= Red Hat Enterprise Linux 8.2<br />
|loginnode= mist.scinet.utoronto.ca<br />
|nnodes= 54 IBM AC922<br />
|rampernode= 256 GB <br />
|gpuspernode=4 V100-SMX2-32GB<br />
|interconnect=Mellanox EDR<br />
|vendorcompilers= NVCC, IBM XL<br />
|queuetype=Slurm<br />
}}<br />
<br />
=Specifications=<br />
Mist is a SciNet-[[#SOSCIP Users |SOSCIP]] joint GPU cluster consisting of 54 IBM AC922 servers. Each node of the cluster has 32 IBM Power9 cores, 256GB RAM and 4 NVIDIA V100-SMX2-32GB GPU with NVLINKs in between. The cluster has InfiniBand EDR interconnection providing GPU-Direct RMDA capability.<br />
<br />
'''<span style="background:#fc8383">Important note:</span>''' the majority of computer systems as of 2021 (laptops, desktops, and HPC) use the 64 bit x86 instruction set architecture (ISA) in their microprocessors produced by Intel and AMD. This ISA is incompatible with Mist, whose hardware uses the 64 bit PPC ISA (set to little endian mode). The practical meaning is that x86-compiled binaries (executables and libraries) cannot be installed on Mist. For this reason, the Niagara and Compute Canada software stacks (modules) cannot be made available on Mist, and using closed-source software is only possible when the vendor provides a compatible version of their application. '''Python applications''' almost always rely on bindings to libraries originally written in C or C++, some of them are not available on PyPI or various Conda channels as precompiled binaries compatible with Mist. The recommended way to use Python on Mist is to create a [[#Anaconda (Python)|Conda]] environment and install packages from the anaconda (default) channel, where most popular packages have a linux-ppc64le (Mist-compatible) version available. Some popular machine learning packages should be installed from the internal [[#Open-CE|Open-CE]] channel. Where a compatible Conda package cannot be found, installing from PyPI (<code>pip install</code>) can be attempted. Pip will attempt to compile the package’s source code if no compatible precompiled wheel is available, therefore a compiler module (such as <code>gcc/.core</code>) should be loaded in advance. Some packages require tweaking of the source code or build procedure to successfully compile on Mist, please contact [[#Support|support]] if you need assistance.<br />
<br />
= Getting started on Mist =<br />
As of January 22 2022, authentication is only allowed via SSH keys. [https://docs.computecanada.ca/wiki/SSH_Keys Please refer to this page] to generate your SSH key pair and make sure you use them securely.<br />
<br />
Mist can be accessed directly:<br />
<pre><br />
ssh -i /path/to/ssh_private_key -Y MYCCUSERNAME@mist.scinet.utoronto.ca<br />
</pre><br />
Mist login node '''mist-login01''' can also be accessed via Niagara cluster.<br />
<pre><br />
ssh -i /path/to/ssh_private_key -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y mist-login01<br />
</pre><br />
== Storage ==<br />
The filesystem for Mist is shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on Mist: use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]] and a list of [[Modules for Mist]] is also available.<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
== Tips for loading software ==<br />
<br />
* We advise '''''against''''' loading modules in your .bashrc. This can lead to very confusing behaviour under certain circumstances. Our guidelines for .bashrc files can be found [[bashrc guidelines|here]].<br />
* Instead, load modules by hand when needed, or by sourcing a separate script.<br />
* Load run-specific modules inside your job submission script.<br />
* Short names give default versions; e.g. <code>cuda</code> → <code>cuda/11.0.3</code>. It is usually better to be explicit about the versions, for future reproducibility.<br />
* Modules often require other modules to be loaded first. Solve these dependencies by using [[Using_modules#Module_spider | <code>module spider</code>]].<br />
<br />
= Available compilers and interpreters =<br />
* <tt>cuda</tt> module has to be loaded first for GPU software.<br />
* For most compiled software, one should use the GNU compilers (<tt>gcc</tt> for C, <tt>g++</tt> for C++, and <tt>gfortran</tt> for Fortran). Loading <tt>gcc</tt> module makes these available. <br />
* The IBM XL compiler suite (<tt>xlc_r, xlc++_r, xlf_r</tt>) is also available, if you load one of the <tt>xl</tt> modules.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> or <tt>spectrum-mpi</tt> module.<br />
<br />
=== CUDA ===<br />
<br />
The current installed CUDA Tookits are '''11.0.3''' and '''10.2.2 (10.2.89)'''<br />
<pre><br />
module load cuda/11.0.3<br />
module load cuda/10.2.2<br />
</pre><br />
*A compiler (GCC, XL or NVHPC/PGI) module must be loaded in order to use CUDA to build any code.<br />
The current NVIDIA driver version is 450.119.04.<br />
<br />
===GNU Compilers ===<br />
<br />
Available GCC modules are:<br />
<pre><br />
gcc/9.3.0 (must load CUDA 11)<br />
gcc/8.5.0 (must load CUDA 10 or 11)<br />
gcc/10.3.0 (w/o CUDA)<br />
</pre><br />
<br />
=== IBM XL Compilers ===<br />
<br />
To load the native IBM xlc/xlc++ and xlf (Fortran) compilers, run<br />
<br />
<pre><br />
module load xl/16.1.1.10<br />
</pre><br />
<br />
IBM XL Compilers are enabled for use with NVIDIA GPUs, including support for OpenMP GPU offloading and integration with NVIDIA's nvcc command to compile host-side code for the POWER9 CPU. Information about the IBM XL Compilers can be found at the following links:[https://www.ibm.com/support/knowledgecenter/SSXVZZ_16.1.1/com.ibm.compilers.linux.doc/welcome.html IBM XL C/C++], <br />
[https://www.ibm.com/support/knowledgecenter/SSAT4T_16.1.1/com.ibm.compilers.linux.doc/welcome.html IBM XL Fortran]<br />
<br />
=== OpenMPI ===<br />
<tt>openmpi/<version></tt> module is avaiable with different compilers including GCC and XL. <tt>spectrum-mpi/<version></tt> module provides IBM Spectrum MPI.<br />
<br />
=== NVHPC/PGI ===<br />
PGI compiler is provided in NVHPC (NVIDIA HPC SDK).<br />
<pre><br />
module load nvhpc/21.3<br />
</pre><br />
<br />
= Software =<br />
== Amber20 ==<br />
<br />
Users who hold Amber20 license can build Amber20 from its source code and run on Mist. '''SOSCIP/SciNet doesn't provide Amber license or source code.'''<br />
<br />
=== Building Amber20 ===<br />
Modules that are needed for building Amber20:<br />
<pre><br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 anaconda3/2021.05 cmake/3.19.8<br />
</pre><br />
Cmake configuration:<br />
<pre><br />
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/where-amber-install -DCOMPILER=GNU -DMPI=FALSE -DCUDA=TRUE -DINSTALL_TESTS=TRUE -DDOWNLOAD_MINICONDA=FALSE -DOPENMP=TRUE -DNCCL=FALSE -DAPPLY_UPDATES=TRUE<br />
</pre><br />
<br />
=== Running Amber20 ===<br />
'''NVIDIA Pascal P100 and later GPUs like V100 do not scale beyond a single GPU'''. It is highly suggested to run Amber20 as a single-gpu job.<br />
A job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
#SBATCH --account=soscip-<SOSCIP-project-ID><br />
<br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 anaconda3/2021.05<br />
export PATH=$HOME/where-amber-install/bin:$PATH<br />
export LD_LIBRARY_PATH=$HOME/where-amber-install/lib:$LD_LIBRARY_PATH<br />
pmemd.cuda .... <parameters> ...<br />
</pre><br />
<br />
== Anaconda (Python) ==<br />
Anaconda is a popular distribution of the Python programming language. It contains several common Python libraries such as SciPy and NumPy as pre-built packages, which eases installation. Anaconda is provided as modules: '''anaconda3'''<br />
<br />
To install Anaconda locally, user need to load the module and create a conda environment:<br />
<pre><br />
module load anaconda3<br />
conda create -n myPythonEnv python=3.8<br />
</pre><br />
*Note: By default, conda environments are located in '''$HOME/.conda/envs'''. Cache (downloaded tarballs and packages) is under '''$HOME/.conda/pkgs'''. User may run into problem with disk quota if there are too many environments created. To clean conda cache, '''please run: "conda clean -y --all" and "rm -rf $HOME/.conda/pkgs/*" after installation of packages'''.<br />
<br />
To activate the conda environment: (should be activated before running python)<br />
<pre><br />
source activate myPythonEnv<br />
</pre><br />
Note that you SHOULD NOT use '''conda activate myPythonEnv''' to activate the environment. This leads to all sorts of problems. Once the environment is activated, user can update or install packages via '''conda''' or '''pip'''<br />
<pre><br />
conda install <package_name> (preferred way to install packages)<br />
pip install <package_name><br />
</pre><br />
*Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
To deactivate:<br />
<pre><br />
source deactivate<br />
</pre><br />
To remove a conda environment:<br />
<pre><br />
conda remove --name myPythonEnv --all<br />
</pre><br />
To verify that the environment was removed, run:<br />
<pre><br />
conda info --envs<br />
</pre><br />
<br />
=== Submitting Python Job ===<br />
A single-gpu job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID> #For SOSCIP projects only<br />
<br />
module load anaconda3<br />
source activate myPythonEnv<br />
python code.py ...<br />
</pre><br />
<br />
== CuPy ==<br />
[https://cupy.chainer.org CuPy] is an open-source matrix library accelerated with NVIDIA CUDA. It also uses CUDA-related libraries including cuBLAS, cuDNN, cuRand, cuSolver, cuSPARSE, cuFFT and NCCL to make full use of the GPU architecture. CuPy is an implementation of NumPy-compatible multi-dimensional array on CUDA. CuPy consists of the core multi-dimensional array class, cupy.ndarray, and many functions on it. It supports a subset of numpy.ndarray interface.<br />
<br />
CuPy can be install into any conda environment. Python packages: numpy, six and fastrlock are required. cuDNN and NCCL are optional.<br />
<pre><br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 nccl/2.9.9 anaconda3/2021.05<br />
conda create -n cupy-env python=3.8 numpy six fastrlock<br />
source activate cupy-env<br />
CFLAGS="-I$MODULE_CUDNN_PREFIX/include -I$MODULE_NCCL_PREFIX/include -I$MODULE_CUDA_PREFIX/include" LDFLAGS="-L$MODULE_CUDNN_PREFIX/lib64 -L$MODULE_NCCL_PREFIX/lib" CUDA_PATH=$MODULE_CUDA_PREFIX pip install cupy<br />
#building/installing CuPy will take a few minutes<br />
</pre><br />
<br />
== Gromacs ==<br />
[http://www.gromacs.org/ GROMACS] is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.<br />
*'''GROMACS 2019'''<br />
<pre><br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2019.6<br />
</pre><br />
*'''GROMACS 2020 and 2021''' Thread-MPI version supports full GPU enablement of all key computational sections. The GPU is used throughout the timestep and repeated CPU-GPU transfers are eliminated. Users are suggested to carefully verify the results.<br />
<pre><br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2020.4<br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2020.6<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.2<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 openmpi/4.1.1+ucx-1.10.0 gromacs/2021.2 (testing purpose only)<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.4<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
</pre><br />
=== Small/Medium Simulation ===<br />
Due to the lack of PME domain decomposition support on GPU, Gromacs uses CPU to calculate PME when using multiple GPUs. '''It is always recommended to use a single GPU to do small and medium sized simulations with Gromacs.''' By using only 1 tMPI thread (w/ multiple OpenMP threads) on a single GPU, both non-bonded PP and PME are atomically offloaded to GPU when possible.<br />
* Gromacs 2019 example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2019.6<br />
export OMP_NUM_THREADS=8<br />
export OMP_PLACES=cores<br />
gmx mdrun -pin off -ntmpi 1 -ntomp 8 ... <other parameters><br />
</pre><br />
<br />
* Gromacs 2020 or 2021 example: <br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
export OMP_NUM_THREADS=8<br />
export OMP_PLACES=cores<br />
export GMX_FORCE_UPDATE_DEFAULT_GPU=true<br />
gmx mdrun -pin off -ntmpi 1 -ntomp 8 ... <other parameters><br />
</pre><br />
<br />
=== Large Simulation ===<br />
If memory size (~58GB) for single-gpu job is not sufficient for the simulation, multiple GPUs can be used. It is suggested to test starting with one full node with 4GPUs and force PME on GPU. Multiple PME ranks are not supported with PME on GPU, so if GPU is used for the PME calculation -npme (number of PME ranks) must be set to 1. If PME has less work than PP, it is suggested to run multiple ranks per GPU, so the GPU for PME rank can also do some work on PP rank(s).<br />
'''If your simulation can fit in a single GPU job, please use single GPU to get much higher efficiency. Do not waste 3 additional GPU resource for getting only a small performance improvement.<br />
'''<br />
*An example using 4 GPUs, 7 PP ranks/tmpi threads + 1 PME rank/tmpi thread: ('''-pin on -pme gpu -npme 1''' must be added to mdrun command in order to force GPU to do PME)<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=1<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
<br />
export OMP_NUM_THREADS=4<br />
gmx mdrun -ntmpi 8 -pin on -pme gpu -npme 1 ... <add your parameters><br />
</pre><br />
*It is suggested to also test using '''-ntmpi 4''' and '''export OMP_NUM_THREADS=8''' if you receive a NOTE in Gromacs output saying "% performance was lost because the PME ranks had more work to do than the PP ranks". In this case, NVIDIA MPS is not needed since there is only one MPI rank per GPU.<br />
*'''Please note that the solving of PME on GPU is still only the initial version supporting this behaviour, and comes with a set of limitations outlined further below.'''<br />
<pre><br />
* Only a PME order of 4 is supported on GPUs.<br />
* PME will run on a GPU only when exactly one rank has a PME task, ie. decompositions with multiple ranks doing PME are not supported.<br />
* Only single precision is supported.<br />
* Free energy calculations where charges are perturbed are not supported, because only single PME grids can be calculated.<br />
* Only dynamical integrators are supported (ie. leap-frog, Velocity Verlet, stochastic dynamics)<br />
* LJ PME is not supported on GPUs.<br />
</pre><br />
*An example using 4 GPUs, '''PME on CPU''': ('''-pin on''' must be added to mdrun command for proper CPU thread bindings)<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=1<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
<br />
export OMP_NUM_THREADS=4<br />
gmx mdrun -ntmpi 8 -pin on ... <add your parameters><br />
<br />
# "-ntmpi 16, OMP_NUM_THREADS=2" and "-ntmpi 4, OMP_NUM_THREADS=8" should also be tested. <br />
# num_thread_MPI_ranks(-ntmpi) * num_OpenMP_threads = 32<br />
</pre><br />
*'''If your simulation can fit in a single GPU job, please use single GPU to get much higher efficiency. Do not waste 3 additional GPU resource for getting only a small performance improvement.'''<br />
*'''NOTE: The above examples will NOT work with multiple nodes. If simulation is too large for a single GPU node, please contact SciNet/SOSCIP support.'''<br />
<br />
== NAMD ==<br />
[http://www.ks.uiuc.edu/Research/namd/ NAMD] is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems.<br />
=== 2.14 ===<br />
<pre><br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
</pre><br />
==== Running with single GPU ====<br />
If you have many jobs to run, it is always suggested to run with a single gpu per job. This makes jobs easier to be scheduled and gives better overall performance.<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --nodes=1<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
scontrol show hostnames > nodelist-$SLURM_JOB_ID<br />
<br />
`which charmrun` -npernode 1 -bind-to none -hostfile nodelist-$SLURM_JOB_ID `which namd2` +idlepoll +ppn 8 +p 8 stmv.namd<br />
</pre><br />
<br />
==== Running with one process per node (4 GPUs)====<br />
An example of the job script (using 1 node, '''one process per node''', 32 CPU threads per process + 4 GPUs per process):<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=1<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
scontrol show hostnames > nodelist-$SLURM_JOB_ID<br />
<br />
`which charmrun` -npernode 1 -hostfile nodelist-$SLURM_JOB_ID `which namd2` +setcpuaffinity +pemap 0-127:4 +idlepoll +ppn 32 +p $((32*SLURM_NTASKS)) stmv.namd<br />
</pre><br />
==== Running with one process per GPU (4 GPUs)====<br />
NAMD may scale better if using '''one process per GPU'''. Please do your own benchmark.<br />
An example of the job script (using 1 node, '''one process per GPU''', 8 CPU threads per process):<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=4<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
scontrol show hostnames > nodelist-$SLURM_JOB_ID<br />
<br />
`which charmrun` -npernode 4 -hostfile nodelist-$SLURM_JOB_ID `which namd2` +setcpuaffinity +pemap 0-127:4 +idlepoll +ppn 8 +p $((8*SLURM_NTASKS)) stmv.namd<br />
</pre><br />
<br />
== Open-CE ==<br />
[https://github.com/open-ce/open-ce Open-CE] is an '''IBM''' repo for feedstock collection, environment data, and scripts for building Tensorflow, Pytorch, and other machine learning packages and dependencies. Open-CE is distributed as a '''conda channel''' on Mist cluster.<br />
'''Available packages and versions are listed here [https://github.com/open-ce/open-ce/releases/tag/open-ce-v1.5.2 Open-CE Releases]'''. Currently only python 3.8 and CUDA 11.2 are supported. If you need a different python or cuda version, please contact SOSCIP or SciNet support.<br />
<br />
*Packages can be installed by setting Open-CE conda channel:<br />
<pre><br />
conda install -c /scinet/mist/ibm/open-ce python=3.8 cudatoolkit=11.2 PACKAGE<br />
</pre><br />
*Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
<br />
{| class="wikitable"<br />
|+Available Packages:<br />
|-<br />
|Tensorflow<br />
|TensorFlow Estimators<br />
|TensorFlow Probability<br />
|TensorBoard<br />
|TensorBoard Data Server<br />
|TensorFlow Text<br />
|TensorFlow Model Optimizations<br />
|TensorFlow Addons<br />
|TensorFlow Datasets<br />
|TensorFlow Hub<br />
|-<br />
|TensorFlow MetaData<br />
|PyTorch<br />
|TorchText<br />
|TorchVision<br />
|PyTorch Lightning<br />
|PyTorch Lightning Bolts<br />
|ONNX<br />
|Onnx-runtime<br />
|skl2onnx<br />
|tf2onnx<br />
|-<br />
|onnxmltools<br />
|onnxconverter-common<br />
|XGBoost<br />
|LightGBM<br />
|Transformers<br />
|Tokenizers<br />
|SentencePiece<br />
|Spacy<br />
|DALI<br />
|OpenCV<br />
|-<br />
|Horovod<br />
|PyArrow<br />
|grpc<br />
|uwsgi<br />
|ORC<br />
|Mamba<br />
|}<br />
<br />
== PyTorch ==<br />
=== Installing from IBM Open-CE Conda Channel ===<br />
The easiest way to install PyTorch on Mist is using IBM's Conda channel. User needs to prepare a conda environment and install PyTorch using IBM's Open-CE Conda channel.<br />
<pre><br />
module load anaconda3<br />
conda create -n pytorch_env python=3.8<br />
source activate pytorch_env<br />
conda install -c /scinet/mist/ibm/open-ce pytorch=1.10.2 cudatoolkit=11.2<br />
or<br />
conda install -c /scinet/mist/ibm/open-ce-1.2 pytorch=1.7.1 cudatoolkit=11.0 (or 10.2)<br />
</pre><br />
Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
Add below command into your job script before python command to get deterministic results, see details here: [https://github.com/pytorch/pytorch/issues/39849]<br />
<pre><br />
export CUBLAS_WORKSPACE_CONFIG=:4096:2<br />
</pre><br />
<br />
== RAPIDS ==<br />
The [https://rapids.ai RAPIDS] is a suite of open source software libraries that gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. The RAPIDS data science framework includes a collection of libraries: '''cuDF(GPU DataFrames)''', '''cuML(GPU Machine Learning Algorithms)''', '''cuStrings(GPU String Manipulation)''', etc.<br />
<br />
=== Installing from IBM Conda Channel ===<br />
The easiest way to install RAPIDS on Mist is using IBM's Conda channel. User needs to prepare a conda environment with Python 3.6 or 3.7 and install powerai-rapids using IBM's Conda channel.<br />
<pre><br />
module load anaconda3<br />
conda create -n rapids_env python=3.7<br />
source activate rapids_env<br />
conda install -c https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/ powerai-rapids<br />
</pre><br />
Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
<br />
== TensorFlow and Keras ==<br />
=== Installing from IBM Conda Channel ===<br />
The easiest way to install TensorFlow and Keras on Mist is using IBM's Open-CE Conda channel. User needs to prepare a conda environment and install TensorFlow using IBM's Open-CE Conda channel.<br />
<pre><br />
module load anaconda3<br />
conda create -n tf_env python=3.8<br />
source activate tf_env<br />
conda install -c /scinet/mist/ibm/open-ce tensorflow==2.7.1 cudatoolkit=11.2<br />
or<br />
conda install -c /scinet/mist/ibm/open-ce-1.2 tensorflow==2.4.3 cudatoolkit=11.0 (or 10.2)<br />
</pre><br />
Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
<br />
= Testing and debugging =<br />
You really should test your code before you submit it to the cluster to know if your code is correct and what kind of resources you need.<br />
* Small test jobs can be run on the login node. Rule of thumb: tests should run no more than a couple of minutes, taking at most about 1-2GB of memory, and use no more than one gpu and a few cores.<br />
<!-- * You can run the [[Parallel Debugging with DDT|DDT]] debugger on the login nodes after <code>module load ddt</code>. --><br />
* Short tests that do not fit on a login node, or for which you need a dedicated node, request an interactive debug job with the debug command:<br />
mist-login01:~$ debugjob --clean -g G<br />
where G is the number of gpus, If G=1, this gives an interactive session for 2 hours, whereas G=4 gets you a single node with 4 gpus for 30 minutes, and with G=8 (the maximum) gets you 2 nodes each with 4 gpus for 15 minutes. The <tt>--clean</tt> argument is optional but recommended as it will start the session without any modules loaded, thus mimicking more closely what happens when you submit a job script. Users needs to load module and activate the conda environment after a debug job starts. It is recommended to do a 'conda clean' before 'source activate ENV' in a debug job if --clean flag is missed.<br />
<br />
= Submitting jobs =<br />
Once you have compiled and tested your code or workflow on the Mist login nodes, and confirmed that it behaves correctly, you are ready to submit jobs to the cluster. Your jobs will run on some of Mist's 53 compute nodes. When and where your job runs is determined by the scheduler.<br />
<br />
Mist uses SLURM as its job scheduler. It is configured to allow only '''Single-GPU jobs''' and '''Full-node jobs (4 GPUs per node)'''.<br />
<br />
You submit jobs from a login node by passing a script to the sbatch command:<br />
<br />
mist-login01:scratch$ sbatch jobscript.sh<br />
<br />
This puts the job in the queue. It will run on the compute nodes in due course. In most cases, you should not submit from your $HOME directory, but rather, from your $SCRATCH directory, so that the output of your compute job can be written out (as mentioned above, $HOME is read-only on the compute nodes).<br />
<br />
Example job scripts can be found below.<br />
Keep in mind:<br />
* Scheduling is by single gpu or by full node, so you ask only 1 gpu or 4 gpus per node.<br />
* Your job's maximum walltime is 24 hours. <br />
* Jobs must write their output to your scratch or project directory (home is read-only on compute nodes).<br />
* Compute nodes have no internet access.<br />
* Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below). <br />
== SOSCIP Users ==<br />
*[https://www.soscip.org SOSCIP] is a consortium to bring together industrial partners and academic researchers and provide them with sophisticated advanced computing technologies and expertise to solve social, technical and business challenges across sectors and drive economic growth.<br />
<br />
If you are working on a SOSCIP project, please contact [mailto:soscip-support@scinet.utoronto.ca soscip-support@scinet.utoronto.ca] to have your user account added to SOSCIP project accounts. SOSCIP users need to submit jobs with additional SLURM flag to get higher priority:<br />
<pre><br />
#SBATCH -A soscip-<SOSCIP_PROJECT_ID> #e.g. soscip-3-001<br />
OR<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID><br />
</pre><br />
<br />
== Single-GPU job script ==<br />
For a single GPU job, each will have a quarter of the node which is 1 GPU + 8/32 CPU Cores/Threads + ~58GB CPU memory. '''Users should never ask CPU or Memory explicitly.''' If running MPI program, user can set --ntasks to be the number of MPI ranks. '''Do NOT set --ntasks for non-MPI programs.''' <br />
*It is suggested to use NVIDIA Multi-Process Service (MPS) if running multiple MPI ranks on one GPU.<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID> #For SOSCIP projects only<br />
<br />
module load anaconda3<br />
source activate conda_env<br />
python code.py ...<br />
</pre><br />
<br />
== Full-node job script ==<br />
'''If you are not sure the program can be executed on multiple GPUs, please follow the single-gpu job instruction above or contact SciNet/SOSCIP support.'''<br />
<br />
Multi-GPU job should ask for a minimum of one full node (4 GPUs). User need to specify "compute_full_node" partition in order to get all resource on a node. <br />
*An example for a 1-node job:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=4 #this only affects MPI job<br />
#SBATCH --time=1:00:00<br />
#SBATCH -p compute_full_node<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID> #For SOSCIP projects only<br />
<br />
module load <modules you need><br />
Run your program<br />
</pre><br />
<br />
== Limits ==<br />
<br />
There are limits to the size and duration of your jobs, the number of jobs you can run and the number of jobs you can have queued.<br />
<br />
{| class="wikitable"<br />
!Usage<br />
!Partition<br />
!Running jobs<br />
!Submitted jobs (incl. running)<br />
!Min. size of jobs<br />
!Max. size of jobs<br />
!Min. walltime<br />
!Max. walltime <br />
|-<br />
|Compute jobs ||compute || 50 || 1000 || 1 GPU (8&nbsp;cores) || default:&nbsp;4&nbsp;nodes&nbsp;(16&nbsp;GPUs) <br> with&nbsp;allocation:&nbsp;4&nbsp;nodes&nbsp;(16&nbsp;GPUs)|| 15 minutes || 24 hours<br />
|-<br />
|Testing or troubleshooting || debug || 1 || 1 || 1 GPU (8 cores) || 2 nodes (8 GPUs)|| N/A || 1 hour<br />
|-<br />
|}<br />
<br />
Even if you respect these limits, your jobs will still have to wait in the queue. The waiting time depends on many factors such as your group's allocation amount, how much allocation has been used in the recent past, the number of requested nodes and walltime, and how many other jobs are waiting in the queue.<br />
<br />
= Jupyter Notebooks =<br />
SciNet’s [[Jupyter Hub]] is a Niagara-type node; it has a different CPU architecture and no GPUs. Conda environments prepared on Mist will not work there properly. Users who need to use Jupyter Notebook to develop and test some aspects of their workflow can create their own server on the Mist login node and use an SSH tunnel to connect to it from outside. Users who choose to do so have to keep in mind that the login node is a shared resource, and heavy calculations should be done only on compute nodes. Processes (including iPython kernels used by the notebooks) are limited to one hour of total CPU time: idle time will not be counted toward this one hour, and use of multiple cores will count proportionally to the number of cores (i.e. a kernel using all 128 virtual cores on the node will be killed after 28 seconds). Idle notebooks can still burden the node by hogging system and GPU memory, please be mindful of other users and terminate notebooks when work is done.<br />
<br />
As an example, let us create a new Conda environment and activate it:<br />
<pre><br />
module load anaconda3<br />
conda create -n jupyter_env python=3.7<br />
source activate jupyter_env<br />
</pre><br />
Install the Jupyter Notebook server:<br />
<pre><br />
conda install notebook<br />
</pre><br />
<br />
== Running the notebook server ==<br />
When the Conda environment is active, enter:<br />
<pre><br />
jupyter-notebook<br />
</pre><br />
By default, the Jupyter Notebook server uses port 8888 (can be overridden with the <code>--port</code> option). If another user has already started their own server, the default port may be busy, in which case the server will be listening on a different port. Once launched, the server will output some information to the terminal that will include the actual port number used and a 48-character token. For example:<br />
<pre>http://localhost:8890/?token=54c4090d……</pre><br />
In this example, the server is listening on port 8890.<br />
<br />
== Creating a tunnel ==<br />
In order to access this port remotely (i.e. from your office or home), an [https://en.wikipedia.org/wiki/Tunneling_protocol#Secure_Shell_tunneling SSH tunnel] has to be established. Please refer to your SSH client’s documentation for instructions on how to do that. For the OpenSSH client (standard in most Linux distributions and macOS), a tunnel can be opened in a separate terminal session to the one where the Jupyter Notebook server is running. In the new terminal, issue this command:<br />
<pre><br />
ssh -L8888:localhost:8890 <username>@mist.scinet.utoronto.ca<br />
</pre><br />
(replace <code><username></code> with your actual username) The tunnel is open as long as this SSH connection is alive. In this example, we tunnel Mist login node’s port 8890 (where our server is assumed to be running) to our home computer’s port 8888 (any other free port is fine). The notebook can be accessed in the browser at the <code><nowiki>http://localhost:8888</nowiki></code> address (followed by <code>/?token=54c4090d……</code>, or the token can be input on the webpage).<br />
<br />
== Using Jupyter on compute nodes ==<br />
<br />
You can use the instructions here to set up a Jupyter Notebook server on a compute node (including a [[#Testing_and_debugging|debugjob]]). '''We strongly discourage''' you from running an interactive notebook on a compute node (other than for a debugjob), scheduled jobs run in arbitrary times and are not meant to be interactive. Jupyter notebooks can be run non-interactively or converted to Python scripts.<br />
<br />
To launch the Jupyter Notebook server, load the <code>anaconda3</code> module and activate your environment as before (by adding the appropriate lines to the submission script, if you are not using the compute node with an interactive shell). Launching the server has to be done like so:<br />
<pre><br />
HOME=/dev/shm/$USER jupyter-notebook<br />
</pre><br />
That is because Jupyter will fail unless it can write to the home folder, which is read-only from compute nodes. This modification of the <code>$HOME</code> environment variable will carry over into the notebooks, which is usually not a problem, but in case the notebook relies on this environment variable (e.g. to read certain files), it can be reset manually in the notebook (<code>import os; os.environ['HOME']=……</code>).<br />
<br />
Because compute nodes are not accessible from the Internet, tunneling has to be done twice, once from the remote location (office or home) to the Mist login node, and then from the login node to the compute node. Assuming the server is running on port 8890 of the mist006 node, open the first tunnel in a new terminal session in the remote computer:<br />
<pre><br />
ssh -L8888:localhost:9999 <username>@mist.scinet.utoronto.ca<br />
</pre><br />
where 9999 is any available port on the Mist login node (to test port availability enter <code>ss -Hln src :9999</code> in the terminal when connected to the Mist login node; an empty output indicates that the port is free). In the same session in the login node that was created with the above command, open the second tunnel to the compute node:<br />
<pre><br />
ssh -L9999:localhost:8890 mist006<br />
</pre><br />
Be aware that the second tunnel will automatically disconnect once the job on the compute node times out or is relinquished. The Jupyter Notebook server running on the compute node can now be accessed from the browser as in the previous subsection.<br />
<br />
<br />
= Support =<br />
<br />
SciNet inquiries:<br />
* [mailto:support@scinet.utoronto.ca support@scinet.utoronto.ca]<br />
<br />
SOSCIP inquiries:<br />
*[mailto:soscip-support@scinet.utoronto.ca soscip-support@scinet.utoronto.ca]</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Mist&diff=3635
Mist
2022-03-09T14:53:01Z
<p>Ejspence: /* Limits */</p>
<hr />
<div>{{Infobox Computer<br />
|image=[[Image:Mist.jpg|center|300px|thumb]]<br />
|name=Mist<br />
|installed=Dec 2019<br />
|operatingsystem= Red Hat Enterprise Linux 8.2<br />
|loginnode= mist.scinet.utoronto.ca<br />
|nnodes= 54 IBM AC922<br />
|rampernode= 256 GB <br />
|gpuspernode=4 V100-SMX2-32GB<br />
|interconnect=Mellanox EDR<br />
|vendorcompilers= NVCC, IBM XL<br />
|queuetype=Slurm<br />
}}<br />
<br />
=Specifications=<br />
Mist is a SciNet-[[#SOSCIP Users |SOSCIP]] joint GPU cluster consisting of 54 IBM AC922 servers. Each node of the cluster has 32 IBM Power9 cores, 256GB RAM and 4 NVIDIA V100-SMX2-32GB GPU with NVLINKs in between. The cluster has InfiniBand EDR interconnection providing GPU-Direct RMDA capability.<br />
<br />
'''<span style="background:#fc8383">Important note:</span>''' the majority of computer systems as of 2021 (laptops, desktops, and HPC) use the 64 bit x86 instruction set architecture (ISA) in their microprocessors produced by Intel and AMD. This ISA is incompatible with Mist, whose hardware uses the 64 bit PPC ISA (set to little endian mode). The practical meaning is that x86-compiled binaries (executables and libraries) cannot be installed on Mist. For this reason, the Niagara and Compute Canada software stacks (modules) cannot be made available on Mist, and using closed-source software is only possible when the vendor provides a compatible version of their application. '''Python applications''' almost always rely on bindings to libraries originally written in C or C++, some of them are not available on PyPI or various Conda channels as precompiled binaries compatible with Mist. The recommended way to use Python on Mist is to create a [[#Anaconda (Python)|Conda]] environment and install packages from the anaconda (default) channel, where most popular packages have a linux-ppc64le (Mist-compatible) version available. Some popular machine learning packages should be installed from the internal [[#Open-CE|Open-CE]] channel. Where a compatible Conda package cannot be found, installing from PyPI (<code>pip install</code>) can be attempted. Pip will attempt to compile the package’s source code if no compatible precompiled wheel is available, therefore a compiler module (such as <code>gcc/.core</code>) should be loaded in advance. Some packages require tweaking of the source code or build procedure to successfully compile on Mist, please contact [[#Support|support]] if you need assistance.<br />
<br />
= Getting started on Mist =<br />
As of January 22 2022, authentication is only allowed via SSH keys. [https://docs.computecanada.ca/wiki/SSH_Keys Please refer to this page] to generate your SSH key pair and make sure you use them securely.<br />
<br />
Mist can be accessed directly:<br />
<pre><br />
ssh -i /path/to/ssh_private_key -Y MYCCUSERNAME@mist.scinet.utoronto.ca<br />
</pre><br />
Mist login node '''mist-login01''' can also be accessed via Niagara cluster.<br />
<pre><br />
ssh -i /path/to/ssh_private_key -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y mist-login01<br />
</pre><br />
== Storage ==<br />
The filesystem for Mist is shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on Mist: use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]] and a list of [[Modules for Mist]] is also available.<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
== Tips for loading software ==<br />
<br />
* We advise '''''against''''' loading modules in your .bashrc. This can lead to very confusing behaviour under certain circumstances. Our guidelines for .bashrc files can be found [[bashrc guidelines|here]].<br />
* Instead, load modules by hand when needed, or by sourcing a separate script.<br />
* Load run-specific modules inside your job submission script.<br />
* Short names give default versions; e.g. <code>cuda</code> → <code>cuda/11.0.3</code>. It is usually better to be explicit about the versions, for future reproducibility.<br />
* Modules often require other modules to be loaded first. Solve these dependencies by using [[Using_modules#Module_spider | <code>module spider</code>]].<br />
<br />
= Available compilers and interpreters =<br />
* <tt>cuda</tt> module has to be loaded first for GPU software.<br />
* For most compiled software, one should use the GNU compilers (<tt>gcc</tt> for C, <tt>g++</tt> for C++, and <tt>gfortran</tt> for Fortran). Loading <tt>gcc</tt> module makes these available. <br />
* The IBM XL compiler suite (<tt>xlc_r, xlc++_r, xlf_r</tt>) is also available, if you load one of the <tt>xl</tt> modules.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> or <tt>spectrum-mpi</tt> module.<br />
<br />
=== CUDA ===<br />
<br />
The current installed CUDA Tookits are '''11.0.3''' and '''10.2.2 (10.2.89)'''<br />
<pre><br />
module load cuda/11.0.3<br />
module load cuda/10.2.2<br />
</pre><br />
*A compiler (GCC, XL or NVHPC/PGI) module must be loaded in order to use CUDA to build any code.<br />
The current NVIDIA driver version is 450.119.04.<br />
<br />
===GNU Compilers ===<br />
<br />
Available GCC modules are:<br />
<pre><br />
gcc/9.3.0 (must load CUDA 11)<br />
gcc/8.5.0 (must load CUDA 10 or 11)<br />
gcc/10.3.0 (w/o CUDA)<br />
</pre><br />
<br />
=== IBM XL Compilers ===<br />
<br />
To load the native IBM xlc/xlc++ and xlf (Fortran) compilers, run<br />
<br />
<pre><br />
module load xl/16.1.1.10<br />
</pre><br />
<br />
IBM XL Compilers are enabled for use with NVIDIA GPUs, including support for OpenMP GPU offloading and integration with NVIDIA's nvcc command to compile host-side code for the POWER9 CPU. Information about the IBM XL Compilers can be found at the following links:[https://www.ibm.com/support/knowledgecenter/SSXVZZ_16.1.1/com.ibm.compilers.linux.doc/welcome.html IBM XL C/C++], <br />
[https://www.ibm.com/support/knowledgecenter/SSAT4T_16.1.1/com.ibm.compilers.linux.doc/welcome.html IBM XL Fortran]<br />
<br />
=== OpenMPI ===<br />
<tt>openmpi/<version></tt> module is avaiable with different compilers including GCC and XL. <tt>spectrum-mpi/<version></tt> module provides IBM Spectrum MPI.<br />
<br />
=== NVHPC/PGI ===<br />
PGI compiler is provided in NVHPC (NVIDIA HPC SDK).<br />
<pre><br />
module load nvhpc/21.3<br />
</pre><br />
<br />
= Software =<br />
== Amber20 ==<br />
<br />
Users who hold Amber20 license can build Amber20 from its source code and run on Mist. '''SOSCIP/SciNet doesn't provide Amber license or source code.'''<br />
<br />
=== Building Amber20 ===<br />
Modules that are needed for building Amber20:<br />
<pre><br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 anaconda3/2021.05 cmake/3.19.8<br />
</pre><br />
Cmake configuration:<br />
<pre><br />
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/where-amber-install -DCOMPILER=GNU -DMPI=FALSE -DCUDA=TRUE -DINSTALL_TESTS=TRUE -DDOWNLOAD_MINICONDA=FALSE -DOPENMP=TRUE -DNCCL=FALSE -DAPPLY_UPDATES=TRUE<br />
</pre><br />
<br />
=== Running Amber20 ===<br />
'''NVIDIA Pascal P100 and later GPUs like V100 do not scale beyond a single GPU'''. It is highly suggested to run Amber20 as a single-gpu job.<br />
A job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
#SBATCH --account=soscip-<SOSCIP-project-ID><br />
<br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 anaconda3/2021.05<br />
export PATH=$HOME/where-amber-install/bin:$PATH<br />
export LD_LIBRARY_PATH=$HOME/where-amber-install/lib:$LD_LIBRARY_PATH<br />
pmemd.cuda .... <parameters> ...<br />
</pre><br />
<br />
== Anaconda (Python) ==<br />
Anaconda is a popular distribution of the Python programming language. It contains several common Python libraries such as SciPy and NumPy as pre-built packages, which eases installation. Anaconda is provided as modules: '''anaconda3'''<br />
<br />
To install Anaconda locally, user need to load the module and create a conda environment:<br />
<pre><br />
module load anaconda3<br />
conda create -n myPythonEnv python=3.8<br />
</pre><br />
*Note: By default, conda environments are located in '''$HOME/.conda/envs'''. Cache (downloaded tarballs and packages) is under '''$HOME/.conda/pkgs'''. User may run into problem with disk quota if there are too many environments created. To clean conda cache, '''please run: "conda clean -y --all" and "rm -rf $HOME/.conda/pkgs/*" after installation of packages'''.<br />
<br />
To activate the conda environment: (should be activated before running python)<br />
<pre><br />
source activate myPythonEnv<br />
</pre><br />
Note that you SHOULD NOT use '''conda activate myPythonEnv''' to activate the environment. This leads to all sorts of problems. Once the environment is activated, user can update or install packages via '''conda''' or '''pip'''<br />
<pre><br />
conda install <package_name> (preferred way to install packages)<br />
pip install <package_name><br />
</pre><br />
*Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
To deactivate:<br />
<pre><br />
source deactivate<br />
</pre><br />
To remove a conda environment:<br />
<pre><br />
conda remove --name myPythonEnv --all<br />
</pre><br />
To verify that the environment was removed, run:<br />
<pre><br />
conda info --envs<br />
</pre><br />
<br />
=== Submitting Python Job ===<br />
A single-gpu job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID> #For SOSCIP projects only<br />
<br />
module load anaconda3<br />
source activate myPythonEnv<br />
python code.py ...<br />
</pre><br />
<br />
== CuPy ==<br />
[https://cupy.chainer.org CuPy] is an open-source matrix library accelerated with NVIDIA CUDA. It also uses CUDA-related libraries including cuBLAS, cuDNN, cuRand, cuSolver, cuSPARSE, cuFFT and NCCL to make full use of the GPU architecture. CuPy is an implementation of NumPy-compatible multi-dimensional array on CUDA. CuPy consists of the core multi-dimensional array class, cupy.ndarray, and many functions on it. It supports a subset of numpy.ndarray interface.<br />
<br />
CuPy can be install into any conda environment. Python packages: numpy, six and fastrlock are required. cuDNN and NCCL are optional.<br />
<pre><br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 nccl/2.9.9 anaconda3/2021.05<br />
conda create -n cupy-env python=3.8 numpy six fastrlock<br />
source activate cupy-env<br />
CFLAGS="-I$MODULE_CUDNN_PREFIX/include -I$MODULE_NCCL_PREFIX/include -I$MODULE_CUDA_PREFIX/include" LDFLAGS="-L$MODULE_CUDNN_PREFIX/lib64 -L$MODULE_NCCL_PREFIX/lib" CUDA_PATH=$MODULE_CUDA_PREFIX pip install cupy<br />
#building/installing CuPy will take a few minutes<br />
</pre><br />
<br />
== Gromacs ==<br />
[http://www.gromacs.org/ GROMACS] is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.<br />
*'''GROMACS 2019'''<br />
<pre><br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2019.6<br />
</pre><br />
*'''GROMACS 2020 and 2021''' Thread-MPI version supports full GPU enablement of all key computational sections. The GPU is used throughout the timestep and repeated CPU-GPU transfers are eliminated. Users are suggested to carefully verify the results.<br />
<pre><br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2020.4<br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2020.6<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.2<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 openmpi/4.1.1+ucx-1.10.0 gromacs/2021.2 (testing purpose only)<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.4<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
</pre><br />
=== Small/Medium Simulation ===<br />
Due to the lack of PME domain decomposition support on GPU, Gromacs uses CPU to calculate PME when using multiple GPUs. '''It is always recommended to use a single GPU to do small and medium sized simulations with Gromacs.''' By using only 1 tMPI thread (w/ multiple OpenMP threads) on a single GPU, both non-bonded PP and PME are atomically offloaded to GPU when possible.<br />
* Gromacs 2019 example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2019.6<br />
export OMP_NUM_THREADS=8<br />
export OMP_PLACES=cores<br />
gmx mdrun -pin off -ntmpi 1 -ntomp 8 ... <other parameters><br />
</pre><br />
<br />
* Gromacs 2020 or 2021 example: <br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
export OMP_NUM_THREADS=8<br />
export OMP_PLACES=cores<br />
export GMX_FORCE_UPDATE_DEFAULT_GPU=true<br />
gmx mdrun -pin off -ntmpi 1 -ntomp 8 ... <other parameters><br />
</pre><br />
<br />
=== Large Simulation ===<br />
If memory size (~58GB) for single-gpu job is not sufficient for the simulation, multiple GPUs can be used. It is suggested to test starting with one full node with 4GPUs and force PME on GPU. Multiple PME ranks are not supported with PME on GPU, so if GPU is used for the PME calculation -npme (number of PME ranks) must be set to 1. If PME has less work than PP, it is suggested to run multiple ranks per GPU, so the GPU for PME rank can also do some work on PP rank(s).<br />
'''If your simulation can fit in a single GPU job, please use single GPU to get much higher efficiency. Do not waste 3 additional GPU resource for getting only a small performance improvement.<br />
'''<br />
*An example using 4 GPUs, 7 PP ranks/tmpi threads + 1 PME rank/tmpi thread: ('''-pin on -pme gpu -npme 1''' must be added to mdrun command in order to force GPU to do PME)<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=1<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
<br />
export OMP_NUM_THREADS=4<br />
gmx mdrun -ntmpi 8 -pin on -pme gpu -npme 1 ... <add your parameters><br />
</pre><br />
*It is suggested to also test using '''-ntmpi 4''' and '''export OMP_NUM_THREADS=8''' if you receive a NOTE in Gromacs output saying "% performance was lost because the PME ranks had more work to do than the PP ranks". In this case, NVIDIA MPS is not needed since there is only one MPI rank per GPU.<br />
*'''Please note that the solving of PME on GPU is still only the initial version supporting this behaviour, and comes with a set of limitations outlined further below.'''<br />
<pre><br />
* Only a PME order of 4 is supported on GPUs.<br />
* PME will run on a GPU only when exactly one rank has a PME task, ie. decompositions with multiple ranks doing PME are not supported.<br />
* Only single precision is supported.<br />
* Free energy calculations where charges are perturbed are not supported, because only single PME grids can be calculated.<br />
* Only dynamical integrators are supported (ie. leap-frog, Velocity Verlet, stochastic dynamics)<br />
* LJ PME is not supported on GPUs.<br />
</pre><br />
*An example using 4 GPUs, '''PME on CPU''': ('''-pin on''' must be added to mdrun command for proper CPU thread bindings)<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=1<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
<br />
export OMP_NUM_THREADS=4<br />
gmx mdrun -ntmpi 8 -pin on ... <add your parameters><br />
<br />
# "-ntmpi 16, OMP_NUM_THREADS=2" and "-ntmpi 4, OMP_NUM_THREADS=8" should also be tested. <br />
# num_thread_MPI_ranks(-ntmpi) * num_OpenMP_threads = 32<br />
</pre><br />
*'''If your simulation can fit in a single GPU job, please use single GPU to get much higher efficiency. Do not waste 3 additional GPU resource for getting only a small performance improvement.'''<br />
*'''NOTE: The above examples will NOT work with multiple nodes. If simulation is too large for a single GPU node, please contact SciNet/SOSCIP support.'''<br />
<br />
== NAMD ==<br />
[http://www.ks.uiuc.edu/Research/namd/ NAMD] is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems.<br />
=== 2.14 ===<br />
<pre><br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
</pre><br />
==== Running with single GPU ====<br />
If you have many jobs to run, it is always suggested to run with a single gpu per job. This makes jobs easier to be scheduled and gives better overall performance.<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --nodes=1<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
scontrol show hostnames > nodelist-$SLURM_JOB_ID<br />
<br />
`which charmrun` -npernode 1 -bind-to none -hostfile nodelist-$SLURM_JOB_ID `which namd2` +idlepoll +ppn 8 +p 8 stmv.namd<br />
</pre><br />
<br />
==== Running with one process per node (4 GPUs)====<br />
An example of the job script (using 1 node, '''one process per node''', 32 CPU threads per process + 4 GPUs per process):<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=1<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
scontrol show hostnames > nodelist-$SLURM_JOB_ID<br />
<br />
`which charmrun` -npernode 1 -hostfile nodelist-$SLURM_JOB_ID `which namd2` +setcpuaffinity +pemap 0-127:4 +idlepoll +ppn 32 +p $((32*SLURM_NTASKS)) stmv.namd<br />
</pre><br />
==== Running with one process per GPU (4 GPUs)====<br />
NAMD may scale better if using '''one process per GPU'''. Please do your own benchmark.<br />
An example of the job script (using 1 node, '''one process per GPU''', 8 CPU threads per process):<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=4<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
scontrol show hostnames > nodelist-$SLURM_JOB_ID<br />
<br />
`which charmrun` -npernode 4 -hostfile nodelist-$SLURM_JOB_ID `which namd2` +setcpuaffinity +pemap 0-127:4 +idlepoll +ppn 8 +p $((8*SLURM_NTASKS)) stmv.namd<br />
</pre><br />
<br />
== Open-CE ==<br />
[https://github.com/open-ce/open-ce Open-CE] is an '''IBM''' repo for feedstock collection, environment data, and scripts for building Tensorflow, Pytorch, and other machine learning packages and dependencies. Open-CE is distributed as a '''conda channel''' on Mist cluster.<br />
'''Available packages and versions are listed here [https://github.com/open-ce/open-ce/releases/tag/open-ce-v1.5.2 Open-CE Releases]'''. Currently only python 3.8 and CUDA 11.2 are supported. If you need a different python or cuda version, please contact SOSCIP or SciNet support.<br />
<br />
*Packages can be installed by setting Open-CE conda channel:<br />
<pre><br />
conda install -c /scinet/mist/ibm/open-ce python=3.8 cudatoolkit=11.2 PACKAGE<br />
</pre><br />
*Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
<br />
{| class="wikitable"<br />
|+Available Packages:<br />
|-<br />
|Tensorflow<br />
|TensorFlow Estimators<br />
|TensorFlow Probability<br />
|TensorBoard<br />
|TensorBoard Data Server<br />
|TensorFlow Text<br />
|TensorFlow Model Optimizations<br />
|TensorFlow Addons<br />
|TensorFlow Datasets<br />
|TensorFlow Hub<br />
|-<br />
|TensorFlow MetaData<br />
|PyTorch<br />
|TorchText<br />
|TorchVision<br />
|PyTorch Lightning<br />
|PyTorch Lightning Bolts<br />
|ONNX<br />
|Onnx-runtime<br />
|skl2onnx<br />
|tf2onnx<br />
|-<br />
|onnxmltools<br />
|onnxconverter-common<br />
|XGBoost<br />
|LightGBM<br />
|Transformers<br />
|Tokenizers<br />
|SentencePiece<br />
|Spacy<br />
|DALI<br />
|OpenCV<br />
|-<br />
|Horovod<br />
|PyArrow<br />
|grpc<br />
|uwsgi<br />
|ORC<br />
|Mamba<br />
|}<br />
<br />
== PyTorch ==<br />
=== Installing from IBM Open-CE Conda Channel ===<br />
The easiest way to install PyTorch on Mist is using IBM's Conda channel. User needs to prepare a conda environment and install PyTorch using IBM's Open-CE Conda channel.<br />
<pre><br />
module load anaconda3<br />
conda create -n pytorch_env python=3.8<br />
source activate pytorch_env<br />
conda install -c /scinet/mist/ibm/open-ce pytorch=1.10.2 cudatoolkit=11.2<br />
or<br />
conda install -c /scinet/mist/ibm/open-ce-1.2 pytorch=1.7.1 cudatoolkit=11.0 (or 10.2)<br />
</pre><br />
Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
Add below command into your job script before python command to get deterministic results, see details here: [https://github.com/pytorch/pytorch/issues/39849]<br />
<pre><br />
export CUBLAS_WORKSPACE_CONFIG=:4096:2<br />
</pre><br />
<br />
== RAPIDS ==<br />
The [https://rapids.ai RAPIDS] is a suite of open source software libraries that gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. The RAPIDS data science framework includes a collection of libraries: '''cuDF(GPU DataFrames)''', '''cuML(GPU Machine Learning Algorithms)''', '''cuStrings(GPU String Manipulation)''', etc.<br />
<br />
=== Installing from IBM Conda Channel ===<br />
The easiest way to install RAPIDS on Mist is using IBM's Conda channel. User needs to prepare a conda environment with Python 3.6 or 3.7 and install powerai-rapids using IBM's Conda channel.<br />
<pre><br />
module load anaconda3<br />
conda create -n rapids_env python=3.7<br />
source activate rapids_env<br />
conda install -c https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/ powerai-rapids<br />
</pre><br />
Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
<br />
== TensorFlow and Keras ==<br />
=== Installing from IBM Conda Channel ===<br />
The easiest way to install TensorFlow and Keras on Mist is using IBM's Open-CE Conda channel. User needs to prepare a conda environment and install TensorFlow using IBM's Open-CE Conda channel.<br />
<pre><br />
module load anaconda3<br />
conda create -n tf_env python=3.8<br />
source activate tf_env<br />
conda install -c /scinet/mist/ibm/open-ce tensorflow==2.7.1 cudatoolkit=11.2<br />
or<br />
conda install -c /scinet/mist/ibm/open-ce-1.2 tensorflow==2.4.3 cudatoolkit=11.0 (or 10.2)<br />
</pre><br />
Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
<br />
= Testing and debugging =<br />
You really should test your code before you submit it to the cluster to know if your code is correct and what kind of resources you need.<br />
* Small test jobs can be run on the login node. Rule of thumb: tests should run no more than a couple of minutes, taking at most about 1-2GB of memory, and use no more than one gpu and a few cores.<br />
<!-- * You can run the [[Parallel Debugging with DDT|DDT]] debugger on the login nodes after <code>module load ddt</code>. --><br />
* Short tests that do not fit on a login node, or for which you need a dedicated node, request an interactive debug job with the debug command:<br />
mist-login01:~$ debugjob --clean -g G<br />
where G is the number of gpus, If G=1, this gives an interactive session for 2 hours, whereas G=4 gets you a single node with 4 gpus for 30 minutes, and with G=8 (the maximum) gets you 2 nodes each with 4 gpus for 15 minutes. The <tt>--clean</tt> argument is optional but recommended as it will start the session without any modules loaded, thus mimicking more closely what happens when you submit a job script. Users needs to load module and activate the conda environment after a debug job starts. It is recommended to do a 'conda clean' before 'source activate ENV' in a debug job if --clean flag is missed.<br />
<br />
= Submitting jobs =<br />
Once you have compiled and tested your code or workflow on the Mist login nodes, and confirmed that it behaves correctly, you are ready to submit jobs to the cluster. Your jobs will run on some of Mist's 53 compute nodes. When and where your job runs is determined by the scheduler.<br />
<br />
Mist uses SLURM as its job scheduler. It is configured to allow only '''Single-GPU jobs''' and '''Full-node jobs (4 GPUs per node)'''.<br />
<br />
You submit jobs from a login node by passing a script to the sbatch command:<br />
<br />
mist-login01:scratch$ sbatch jobscript.sh<br />
<br />
This puts the job in the queue. It will run on the compute nodes in due course. In most cases, you should not submit from your $HOME directory, but rather, from your $SCRATCH directory, so that the output of your compute job can be written out (as mentioned above, $HOME is read-only on the compute nodes).<br />
<br />
Example job scripts can be found below.<br />
Keep in mind:<br />
* Scheduling is by single gpu or by full node, so you ask only 1 gpu or 4 gpus per node.<br />
* Your job's maximum walltime is 24 hours. <br />
* Jobs must write their output to your scratch or project directory (home is read-only on compute nodes).<br />
* Compute nodes have no internet access.<br />
* Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below). <br />
== SOSCIP Users ==<br />
*[https://www.soscip.org SOSCIP] is a consortium to bring together industrial partners and academic researchers and provide them with sophisticated advanced computing technologies and expertise to solve social, technical and business challenges across sectors and drive economic growth.<br />
<br />
If you are working on a SOSCIP project, please contact [mailto:soscip-support@scinet.utoronto.ca soscip-support@scinet.utoronto.ca] to have your user account added to SOSCIP project accounts. SOSCIP users need to submit jobs with additional SLURM flag to get higher priority:<br />
<pre><br />
#SBATCH -A soscip-<SOSCIP_PROJECT_ID> #e.g. soscip-3-001<br />
OR<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID><br />
</pre><br />
<br />
== Single-GPU job script ==<br />
For a single GPU job, each will have a quarter of the node which is 1 GPU + 8/32 CPU Cores/Threads + ~58GB CPU memory. '''Users should never ask CPU or Memory explicitly.''' If running MPI program, user can set --ntasks to be the number of MPI ranks. '''Do NOT set --ntasks for non-MPI programs.''' <br />
*It is suggested to use NVIDIA Multi-Process Service (MPS) if running multiple MPI ranks on one GPU.<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID> #For SOSCIP projects only<br />
<br />
module load anaconda3<br />
source activate conda_env<br />
python code.py ...<br />
</pre><br />
<br />
== Full-node job script ==<br />
'''If you are not sure the program can be executed on multiple GPUs, please follow the single-gpu job instruction above or contact SciNet/SOSCIP support.'''<br />
<br />
Multi-GPU job should ask for a minimum of one full node (4 GPUs). User need to specify "compute_full_node" partition in order to get all resource on a node. <br />
*An example for a 1-node job:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=4 #this only affects MPI job<br />
#SBATCH --time=1:00:00<br />
#SBATCH -p compute_full_node<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID> #For SOSCIP projects only<br />
<br />
module load <modules you need><br />
Run your program<br />
</pre><br />
<br />
== Limits ==<br />
<br />
There are limits to the size and duration of your jobs, the number of jobs you can run and the number of jobs you can have queued.<br />
<br />
{| class="wikitable"<br />
!Usage<br />
!Partition<br />
!Running jobs<br />
!Submitted jobs (incl. running)<br />
!Min. size of jobs<br />
!Max. size of jobs<br />
!Min. walltime<br />
!Max. walltime <br />
|-<br />
|Compute jobs ||compute || 50 || 1000 || 1 GPU (8&nbsp;cores) || default:&nbsp;4&nbsp;nodes&nbsp;(16&nbsp;GPUs) <br> with&nbsp;allocation:&nbsp;4&nbsp;nodes&nbsp;(16&nbsp;GPUs)|| 15 minutes || 24 hours<br />
|-<br />
|Testing or troubleshooting || debug || 1 || 1 || 1 node (40 cores) || 4 nodes (160 cores)|| N/A || 1 hour<br />
|-<br />
|}<br />
<br />
Even if you respect these limits, your jobs will still have to wait in the queue. The waiting time depends on many factors such as your group's allocation amount, how much allocation has been used in the recent past, the number of requested nodes and walltime, and how many other jobs are waiting in the queue.<br />
<br />
= Jupyter Notebooks =<br />
SciNet’s [[Jupyter Hub]] is a Niagara-type node; it has a different CPU architecture and no GPUs. Conda environments prepared on Mist will not work there properly. Users who need to use Jupyter Notebook to develop and test some aspects of their workflow can create their own server on the Mist login node and use an SSH tunnel to connect to it from outside. Users who choose to do so have to keep in mind that the login node is a shared resource, and heavy calculations should be done only on compute nodes. Processes (including iPython kernels used by the notebooks) are limited to one hour of total CPU time: idle time will not be counted toward this one hour, and use of multiple cores will count proportionally to the number of cores (i.e. a kernel using all 128 virtual cores on the node will be killed after 28 seconds). Idle notebooks can still burden the node by hogging system and GPU memory, please be mindful of other users and terminate notebooks when work is done.<br />
<br />
As an example, let us create a new Conda environment and activate it:<br />
<pre><br />
module load anaconda3<br />
conda create -n jupyter_env python=3.7<br />
source activate jupyter_env<br />
</pre><br />
Install the Jupyter Notebook server:<br />
<pre><br />
conda install notebook<br />
</pre><br />
<br />
== Running the notebook server ==<br />
When the Conda environment is active, enter:<br />
<pre><br />
jupyter-notebook<br />
</pre><br />
By default, the Jupyter Notebook server uses port 8888 (can be overridden with the <code>--port</code> option). If another user has already started their own server, the default port may be busy, in which case the server will be listening on a different port. Once launched, the server will output some information to the terminal that will include the actual port number used and a 48-character token. For example:<br />
<pre>http://localhost:8890/?token=54c4090d……</pre><br />
In this example, the server is listening on port 8890.<br />
<br />
== Creating a tunnel ==<br />
In order to access this port remotely (i.e. from your office or home), an [https://en.wikipedia.org/wiki/Tunneling_protocol#Secure_Shell_tunneling SSH tunnel] has to be established. Please refer to your SSH client’s documentation for instructions on how to do that. For the OpenSSH client (standard in most Linux distributions and macOS), a tunnel can be opened in a separate terminal session to the one where the Jupyter Notebook server is running. In the new terminal, issue this command:<br />
<pre><br />
ssh -L8888:localhost:8890 <username>@mist.scinet.utoronto.ca<br />
</pre><br />
(replace <code><username></code> with your actual username) The tunnel is open as long as this SSH connection is alive. In this example, we tunnel Mist login node’s port 8890 (where our server is assumed to be running) to our home computer’s port 8888 (any other free port is fine). The notebook can be accessed in the browser at the <code><nowiki>http://localhost:8888</nowiki></code> address (followed by <code>/?token=54c4090d……</code>, or the token can be input on the webpage).<br />
<br />
== Using Jupyter on compute nodes ==<br />
<br />
You can use the instructions here to set up a Jupyter Notebook server on a compute node (including a [[#Testing_and_debugging|debugjob]]). '''We strongly discourage''' you from running an interactive notebook on a compute node (other than for a debugjob), scheduled jobs run in arbitrary times and are not meant to be interactive. Jupyter notebooks can be run non-interactively or converted to Python scripts.<br />
<br />
To launch the Jupyter Notebook server, load the <code>anaconda3</code> module and activate your environment as before (by adding the appropriate lines to the submission script, if you are not using the compute node with an interactive shell). Launching the server has to be done like so:<br />
<pre><br />
HOME=/dev/shm/$USER jupyter-notebook<br />
</pre><br />
That is because Jupyter will fail unless it can write to the home folder, which is read-only from compute nodes. This modification of the <code>$HOME</code> environment variable will carry over into the notebooks, which is usually not a problem, but in case the notebook relies on this environment variable (e.g. to read certain files), it can be reset manually in the notebook (<code>import os; os.environ['HOME']=……</code>).<br />
<br />
Because compute nodes are not accessible from the Internet, tunneling has to be done twice, once from the remote location (office or home) to the Mist login node, and then from the login node to the compute node. Assuming the server is running on port 8890 of the mist006 node, open the first tunnel in a new terminal session in the remote computer:<br />
<pre><br />
ssh -L8888:localhost:9999 <username>@mist.scinet.utoronto.ca<br />
</pre><br />
where 9999 is any available port on the Mist login node (to test port availability enter <code>ss -Hln src :9999</code> in the terminal when connected to the Mist login node; an empty output indicates that the port is free). In the same session in the login node that was created with the above command, open the second tunnel to the compute node:<br />
<pre><br />
ssh -L9999:localhost:8890 mist006<br />
</pre><br />
Be aware that the second tunnel will automatically disconnect once the job on the compute node times out or is relinquished. The Jupyter Notebook server running on the compute node can now be accessed from the browser as in the previous subsection.<br />
<br />
<br />
= Support =<br />
<br />
SciNet inquiries:<br />
* [mailto:support@scinet.utoronto.ca support@scinet.utoronto.ca]<br />
<br />
SOSCIP inquiries:<br />
*[mailto:soscip-support@scinet.utoronto.ca soscip-support@scinet.utoronto.ca]</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Mist&diff=3632
Mist
2022-03-09T14:43:27Z
<p>Ejspence: /* Limits */</p>
<hr />
<div>{{Infobox Computer<br />
|image=[[Image:Mist.jpg|center|300px|thumb]]<br />
|name=Mist<br />
|installed=Dec 2019<br />
|operatingsystem= Red Hat Enterprise Linux 8.2<br />
|loginnode= mist.scinet.utoronto.ca<br />
|nnodes= 54 IBM AC922<br />
|rampernode= 256 GB <br />
|gpuspernode=4 V100-SMX2-32GB<br />
|interconnect=Mellanox EDR<br />
|vendorcompilers= NVCC, IBM XL<br />
|queuetype=Slurm<br />
}}<br />
<br />
=Specifications=<br />
Mist is a SciNet-[[#SOSCIP Users |SOSCIP]] joint GPU cluster consisting of 54 IBM AC922 servers. Each node of the cluster has 32 IBM Power9 cores, 256GB RAM and 4 NVIDIA V100-SMX2-32GB GPU with NVLINKs in between. The cluster has InfiniBand EDR interconnection providing GPU-Direct RMDA capability.<br />
<br />
'''<span style="background:#fc8383">Important note:</span>''' the majority of computer systems as of 2021 (laptops, desktops, and HPC) use the 64 bit x86 instruction set architecture (ISA) in their microprocessors produced by Intel and AMD. This ISA is incompatible with Mist, whose hardware uses the 64 bit PPC ISA (set to little endian mode). The practical meaning is that x86-compiled binaries (executables and libraries) cannot be installed on Mist. For this reason, the Niagara and Compute Canada software stacks (modules) cannot be made available on Mist, and using closed-source software is only possible when the vendor provides a compatible version of their application. '''Python applications''' almost always rely on bindings to libraries originally written in C or C++, some of them are not available on PyPI or various Conda channels as precompiled binaries compatible with Mist. The recommended way to use Python on Mist is to create a [[#Anaconda (Python)|Conda]] environment and install packages from the anaconda (default) channel, where most popular packages have a linux-ppc64le (Mist-compatible) version available. Some popular machine learning packages should be installed from the internal [[#Open-CE|Open-CE]] channel. Where a compatible Conda package cannot be found, installing from PyPI (<code>pip install</code>) can be attempted. Pip will attempt to compile the package’s source code if no compatible precompiled wheel is available, therefore a compiler module (such as <code>gcc/.core</code>) should be loaded in advance. Some packages require tweaking of the source code or build procedure to successfully compile on Mist, please contact [[#Support|support]] if you need assistance.<br />
<br />
= Getting started on Mist =<br />
As of January 22 2022, authentication is only allowed via SSH keys. [https://docs.computecanada.ca/wiki/SSH_Keys Please refer to this page] to generate your SSH key pair and make sure you use them securely.<br />
<br />
Mist can be accessed directly:<br />
<pre><br />
ssh -i /path/to/ssh_private_key -Y MYCCUSERNAME@mist.scinet.utoronto.ca<br />
</pre><br />
Mist login node '''mist-login01''' can also be accessed via Niagara cluster.<br />
<pre><br />
ssh -i /path/to/ssh_private_key -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y mist-login01<br />
</pre><br />
== Storage ==<br />
The filesystem for Mist is shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on Mist: use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]] and a list of [[Modules for Mist]] is also available.<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
== Tips for loading software ==<br />
<br />
* We advise '''''against''''' loading modules in your .bashrc. This can lead to very confusing behaviour under certain circumstances. Our guidelines for .bashrc files can be found [[bashrc guidelines|here]].<br />
* Instead, load modules by hand when needed, or by sourcing a separate script.<br />
* Load run-specific modules inside your job submission script.<br />
* Short names give default versions; e.g. <code>cuda</code> → <code>cuda/11.0.3</code>. It is usually better to be explicit about the versions, for future reproducibility.<br />
* Modules often require other modules to be loaded first. Solve these dependencies by using [[Using_modules#Module_spider | <code>module spider</code>]].<br />
<br />
= Available compilers and interpreters =<br />
* <tt>cuda</tt> module has to be loaded first for GPU software.<br />
* For most compiled software, one should use the GNU compilers (<tt>gcc</tt> for C, <tt>g++</tt> for C++, and <tt>gfortran</tt> for Fortran). Loading <tt>gcc</tt> module makes these available. <br />
* The IBM XL compiler suite (<tt>xlc_r, xlc++_r, xlf_r</tt>) is also available, if you load one of the <tt>xl</tt> modules.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> or <tt>spectrum-mpi</tt> module.<br />
<br />
=== CUDA ===<br />
<br />
The current installed CUDA Tookits are '''11.0.3''' and '''10.2.2 (10.2.89)'''<br />
<pre><br />
module load cuda/11.0.3<br />
module load cuda/10.2.2<br />
</pre><br />
*A compiler (GCC, XL or NVHPC/PGI) module must be loaded in order to use CUDA to build any code.<br />
The current NVIDIA driver version is 450.119.04.<br />
<br />
===GNU Compilers ===<br />
<br />
Available GCC modules are:<br />
<pre><br />
gcc/9.3.0 (must load CUDA 11)<br />
gcc/8.5.0 (must load CUDA 10 or 11)<br />
gcc/10.3.0 (w/o CUDA)<br />
</pre><br />
<br />
=== IBM XL Compilers ===<br />
<br />
To load the native IBM xlc/xlc++ and xlf (Fortran) compilers, run<br />
<br />
<pre><br />
module load xl/16.1.1.10<br />
</pre><br />
<br />
IBM XL Compilers are enabled for use with NVIDIA GPUs, including support for OpenMP GPU offloading and integration with NVIDIA's nvcc command to compile host-side code for the POWER9 CPU. Information about the IBM XL Compilers can be found at the following links:[https://www.ibm.com/support/knowledgecenter/SSXVZZ_16.1.1/com.ibm.compilers.linux.doc/welcome.html IBM XL C/C++], <br />
[https://www.ibm.com/support/knowledgecenter/SSAT4T_16.1.1/com.ibm.compilers.linux.doc/welcome.html IBM XL Fortran]<br />
<br />
=== OpenMPI ===<br />
<tt>openmpi/<version></tt> module is avaiable with different compilers including GCC and XL. <tt>spectrum-mpi/<version></tt> module provides IBM Spectrum MPI.<br />
<br />
=== NVHPC/PGI ===<br />
PGI compiler is provided in NVHPC (NVIDIA HPC SDK).<br />
<pre><br />
module load nvhpc/21.3<br />
</pre><br />
<br />
= Software =<br />
== Amber20 ==<br />
<br />
Users who hold Amber20 license can build Amber20 from its source code and run on Mist. '''SOSCIP/SciNet doesn't provide Amber license or source code.'''<br />
<br />
=== Building Amber20 ===<br />
Modules that are needed for building Amber20:<br />
<pre><br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 anaconda3/2021.05 cmake/3.19.8<br />
</pre><br />
Cmake configuration:<br />
<pre><br />
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/where-amber-install -DCOMPILER=GNU -DMPI=FALSE -DCUDA=TRUE -DINSTALL_TESTS=TRUE -DDOWNLOAD_MINICONDA=FALSE -DOPENMP=TRUE -DNCCL=FALSE -DAPPLY_UPDATES=TRUE<br />
</pre><br />
<br />
=== Running Amber20 ===<br />
'''NVIDIA Pascal P100 and later GPUs like V100 do not scale beyond a single GPU'''. It is highly suggested to run Amber20 as a single-gpu job.<br />
A job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
#SBATCH --account=soscip-<SOSCIP-project-ID><br />
<br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 anaconda3/2021.05<br />
export PATH=$HOME/where-amber-install/bin:$PATH<br />
export LD_LIBRARY_PATH=$HOME/where-amber-install/lib:$LD_LIBRARY_PATH<br />
pmemd.cuda .... <parameters> ...<br />
</pre><br />
<br />
== Anaconda (Python) ==<br />
Anaconda is a popular distribution of the Python programming language. It contains several common Python libraries such as SciPy and NumPy as pre-built packages, which eases installation. Anaconda is provided as modules: '''anaconda3'''<br />
<br />
To install Anaconda locally, user need to load the module and create a conda environment:<br />
<pre><br />
module load anaconda3<br />
conda create -n myPythonEnv python=3.8<br />
</pre><br />
*Note: By default, conda environments are located in '''$HOME/.conda/envs'''. Cache (downloaded tarballs and packages) is under '''$HOME/.conda/pkgs'''. User may run into problem with disk quota if there are too many environments created. To clean conda cache, '''please run: "conda clean -y --all" and "rm -rf $HOME/.conda/pkgs/*" after installation of packages'''.<br />
<br />
To activate the conda environment: (should be activated before running python)<br />
<pre><br />
source activate myPythonEnv<br />
</pre><br />
Note that you SHOULD NOT use '''conda activate myPythonEnv''' to activate the environment. This leads to all sorts of problems. Once the environment is activated, user can update or install packages via '''conda''' or '''pip'''<br />
<pre><br />
conda install <package_name> (preferred way to install packages)<br />
pip install <package_name><br />
</pre><br />
*Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
To deactivate:<br />
<pre><br />
source deactivate<br />
</pre><br />
To remove a conda environment:<br />
<pre><br />
conda remove --name myPythonEnv --all<br />
</pre><br />
To verify that the environment was removed, run:<br />
<pre><br />
conda info --envs<br />
</pre><br />
<br />
=== Submitting Python Job ===<br />
A single-gpu job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID> #For SOSCIP projects only<br />
<br />
module load anaconda3<br />
source activate myPythonEnv<br />
python code.py ...<br />
</pre><br />
<br />
== CuPy ==<br />
[https://cupy.chainer.org CuPy] is an open-source matrix library accelerated with NVIDIA CUDA. It also uses CUDA-related libraries including cuBLAS, cuDNN, cuRand, cuSolver, cuSPARSE, cuFFT and NCCL to make full use of the GPU architecture. CuPy is an implementation of NumPy-compatible multi-dimensional array on CUDA. CuPy consists of the core multi-dimensional array class, cupy.ndarray, and many functions on it. It supports a subset of numpy.ndarray interface.<br />
<br />
CuPy can be install into any conda environment. Python packages: numpy, six and fastrlock are required. cuDNN and NCCL are optional.<br />
<pre><br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 nccl/2.9.9 anaconda3/2021.05<br />
conda create -n cupy-env python=3.8 numpy six fastrlock<br />
source activate cupy-env<br />
CFLAGS="-I$MODULE_CUDNN_PREFIX/include -I$MODULE_NCCL_PREFIX/include -I$MODULE_CUDA_PREFIX/include" LDFLAGS="-L$MODULE_CUDNN_PREFIX/lib64 -L$MODULE_NCCL_PREFIX/lib" CUDA_PATH=$MODULE_CUDA_PREFIX pip install cupy<br />
#building/installing CuPy will take a few minutes<br />
</pre><br />
<br />
== Gromacs ==<br />
[http://www.gromacs.org/ GROMACS] is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.<br />
*'''GROMACS 2019'''<br />
<pre><br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2019.6<br />
</pre><br />
*'''GROMACS 2020 and 2021''' Thread-MPI version supports full GPU enablement of all key computational sections. The GPU is used throughout the timestep and repeated CPU-GPU transfers are eliminated. Users are suggested to carefully verify the results.<br />
<pre><br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2020.4<br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2020.6<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.2<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 openmpi/4.1.1+ucx-1.10.0 gromacs/2021.2 (testing purpose only)<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.4<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
</pre><br />
=== Small/Medium Simulation ===<br />
Due to the lack of PME domain decomposition support on GPU, Gromacs uses CPU to calculate PME when using multiple GPUs. '''It is always recommended to use a single GPU to do small and medium sized simulations with Gromacs.''' By using only 1 tMPI thread (w/ multiple OpenMP threads) on a single GPU, both non-bonded PP and PME are atomically offloaded to GPU when possible.<br />
* Gromacs 2019 example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2019.6<br />
export OMP_NUM_THREADS=8<br />
export OMP_PLACES=cores<br />
gmx mdrun -pin off -ntmpi 1 -ntomp 8 ... <other parameters><br />
</pre><br />
<br />
* Gromacs 2020 or 2021 example: <br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
export OMP_NUM_THREADS=8<br />
export OMP_PLACES=cores<br />
export GMX_FORCE_UPDATE_DEFAULT_GPU=true<br />
gmx mdrun -pin off -ntmpi 1 -ntomp 8 ... <other parameters><br />
</pre><br />
<br />
=== Large Simulation ===<br />
If memory size (~58GB) for single-gpu job is not sufficient for the simulation, multiple GPUs can be used. It is suggested to test starting with one full node with 4GPUs and force PME on GPU. Multiple PME ranks are not supported with PME on GPU, so if GPU is used for the PME calculation -npme (number of PME ranks) must be set to 1. If PME has less work than PP, it is suggested to run multiple ranks per GPU, so the GPU for PME rank can also do some work on PP rank(s).<br />
'''If your simulation can fit in a single GPU job, please use single GPU to get much higher efficiency. Do not waste 3 additional GPU resource for getting only a small performance improvement.<br />
'''<br />
*An example using 4 GPUs, 7 PP ranks/tmpi threads + 1 PME rank/tmpi thread: ('''-pin on -pme gpu -npme 1''' must be added to mdrun command in order to force GPU to do PME)<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=1<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
<br />
export OMP_NUM_THREADS=4<br />
gmx mdrun -ntmpi 8 -pin on -pme gpu -npme 1 ... <add your parameters><br />
</pre><br />
*It is suggested to also test using '''-ntmpi 4''' and '''export OMP_NUM_THREADS=8''' if you receive a NOTE in Gromacs output saying "% performance was lost because the PME ranks had more work to do than the PP ranks". In this case, NVIDIA MPS is not needed since there is only one MPI rank per GPU.<br />
*'''Please note that the solving of PME on GPU is still only the initial version supporting this behaviour, and comes with a set of limitations outlined further below.'''<br />
<pre><br />
* Only a PME order of 4 is supported on GPUs.<br />
* PME will run on a GPU only when exactly one rank has a PME task, ie. decompositions with multiple ranks doing PME are not supported.<br />
* Only single precision is supported.<br />
* Free energy calculations where charges are perturbed are not supported, because only single PME grids can be calculated.<br />
* Only dynamical integrators are supported (ie. leap-frog, Velocity Verlet, stochastic dynamics)<br />
* LJ PME is not supported on GPUs.<br />
</pre><br />
*An example using 4 GPUs, '''PME on CPU''': ('''-pin on''' must be added to mdrun command for proper CPU thread bindings)<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=1<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
<br />
export OMP_NUM_THREADS=4<br />
gmx mdrun -ntmpi 8 -pin on ... <add your parameters><br />
<br />
# "-ntmpi 16, OMP_NUM_THREADS=2" and "-ntmpi 4, OMP_NUM_THREADS=8" should also be tested. <br />
# num_thread_MPI_ranks(-ntmpi) * num_OpenMP_threads = 32<br />
</pre><br />
*'''If your simulation can fit in a single GPU job, please use single GPU to get much higher efficiency. Do not waste 3 additional GPU resource for getting only a small performance improvement.'''<br />
*'''NOTE: The above examples will NOT work with multiple nodes. If simulation is too large for a single GPU node, please contact SciNet/SOSCIP support.'''<br />
<br />
== NAMD ==<br />
[http://www.ks.uiuc.edu/Research/namd/ NAMD] is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems.<br />
=== 2.14 ===<br />
<pre><br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
</pre><br />
==== Running with single GPU ====<br />
If you have many jobs to run, it is always suggested to run with a single gpu per job. This makes jobs easier to be scheduled and gives better overall performance.<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --nodes=1<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
scontrol show hostnames > nodelist-$SLURM_JOB_ID<br />
<br />
`which charmrun` -npernode 1 -bind-to none -hostfile nodelist-$SLURM_JOB_ID `which namd2` +idlepoll +ppn 8 +p 8 stmv.namd<br />
</pre><br />
<br />
==== Running with one process per node (4 GPUs)====<br />
An example of the job script (using 1 node, '''one process per node''', 32 CPU threads per process + 4 GPUs per process):<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=1<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
scontrol show hostnames > nodelist-$SLURM_JOB_ID<br />
<br />
`which charmrun` -npernode 1 -hostfile nodelist-$SLURM_JOB_ID `which namd2` +setcpuaffinity +pemap 0-127:4 +idlepoll +ppn 32 +p $((32*SLURM_NTASKS)) stmv.namd<br />
</pre><br />
==== Running with one process per GPU (4 GPUs)====<br />
NAMD may scale better if using '''one process per GPU'''. Please do your own benchmark.<br />
An example of the job script (using 1 node, '''one process per GPU''', 8 CPU threads per process):<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=4<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
scontrol show hostnames > nodelist-$SLURM_JOB_ID<br />
<br />
`which charmrun` -npernode 4 -hostfile nodelist-$SLURM_JOB_ID `which namd2` +setcpuaffinity +pemap 0-127:4 +idlepoll +ppn 8 +p $((8*SLURM_NTASKS)) stmv.namd<br />
</pre><br />
<br />
== Open-CE ==<br />
[https://github.com/open-ce/open-ce Open-CE] is an '''IBM''' repo for feedstock collection, environment data, and scripts for building Tensorflow, Pytorch, and other machine learning packages and dependencies. Open-CE is distributed as a '''conda channel''' on Mist cluster.<br />
'''Available packages and versions are listed here [https://github.com/open-ce/open-ce/releases/tag/open-ce-v1.5.2 Open-CE Releases]'''. Currently only python 3.8 and CUDA 11.2 are supported. If you need a different python or cuda version, please contact SOSCIP or SciNet support.<br />
<br />
*Packages can be installed by setting Open-CE conda channel:<br />
<pre><br />
conda install -c /scinet/mist/ibm/open-ce python=3.8 cudatoolkit=11.2 PACKAGE<br />
</pre><br />
*Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
<br />
{| class="wikitable"<br />
|+Available Packages:<br />
|-<br />
|Tensorflow<br />
|TensorFlow Estimators<br />
|TensorFlow Probability<br />
|TensorBoard<br />
|TensorBoard Data Server<br />
|TensorFlow Text<br />
|TensorFlow Model Optimizations<br />
|TensorFlow Addons<br />
|TensorFlow Datasets<br />
|TensorFlow Hub<br />
|-<br />
|TensorFlow MetaData<br />
|PyTorch<br />
|TorchText<br />
|TorchVision<br />
|PyTorch Lightning<br />
|PyTorch Lightning Bolts<br />
|ONNX<br />
|Onnx-runtime<br />
|skl2onnx<br />
|tf2onnx<br />
|-<br />
|onnxmltools<br />
|onnxconverter-common<br />
|XGBoost<br />
|LightGBM<br />
|Transformers<br />
|Tokenizers<br />
|SentencePiece<br />
|Spacy<br />
|DALI<br />
|OpenCV<br />
|-<br />
|Horovod<br />
|PyArrow<br />
|grpc<br />
|uwsgi<br />
|ORC<br />
|Mamba<br />
|}<br />
<br />
== PyTorch ==<br />
=== Installing from IBM Open-CE Conda Channel ===<br />
The easiest way to install PyTorch on Mist is using IBM's Conda channel. User needs to prepare a conda environment and install PyTorch using IBM's Open-CE Conda channel.<br />
<pre><br />
module load anaconda3<br />
conda create -n pytorch_env python=3.8<br />
source activate pytorch_env<br />
conda install -c /scinet/mist/ibm/open-ce pytorch=1.10.2 cudatoolkit=11.2<br />
or<br />
conda install -c /scinet/mist/ibm/open-ce-1.2 pytorch=1.7.1 cudatoolkit=11.0 (or 10.2)<br />
</pre><br />
Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
Add below command into your job script before python command to get deterministic results, see details here: [https://github.com/pytorch/pytorch/issues/39849]<br />
<pre><br />
export CUBLAS_WORKSPACE_CONFIG=:4096:2<br />
</pre><br />
<br />
== RAPIDS ==<br />
The [https://rapids.ai RAPIDS] is a suite of open source software libraries that gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. The RAPIDS data science framework includes a collection of libraries: '''cuDF(GPU DataFrames)''', '''cuML(GPU Machine Learning Algorithms)''', '''cuStrings(GPU String Manipulation)''', etc.<br />
<br />
=== Installing from IBM Conda Channel ===<br />
The easiest way to install RAPIDS on Mist is using IBM's Conda channel. User needs to prepare a conda environment with Python 3.6 or 3.7 and install powerai-rapids using IBM's Conda channel.<br />
<pre><br />
module load anaconda3<br />
conda create -n rapids_env python=3.7<br />
source activate rapids_env<br />
conda install -c https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/ powerai-rapids<br />
</pre><br />
Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
<br />
== TensorFlow and Keras ==<br />
=== Installing from IBM Conda Channel ===<br />
The easiest way to install TensorFlow and Keras on Mist is using IBM's Open-CE Conda channel. User needs to prepare a conda environment and install TensorFlow using IBM's Open-CE Conda channel.<br />
<pre><br />
module load anaconda3<br />
conda create -n tf_env python=3.8<br />
source activate tf_env<br />
conda install -c /scinet/mist/ibm/open-ce tensorflow==2.7.1 cudatoolkit=11.2<br />
or<br />
conda install -c /scinet/mist/ibm/open-ce-1.2 tensorflow==2.4.3 cudatoolkit=11.0 (or 10.2)<br />
</pre><br />
Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
<br />
= Testing and debugging =<br />
You really should test your code before you submit it to the cluster to know if your code is correct and what kind of resources you need.<br />
* Small test jobs can be run on the login node. Rule of thumb: tests should run no more than a couple of minutes, taking at most about 1-2GB of memory, and use no more than one gpu and a few cores.<br />
<!-- * You can run the [[Parallel Debugging with DDT|DDT]] debugger on the login nodes after <code>module load ddt</code>. --><br />
* Short tests that do not fit on a login node, or for which you need a dedicated node, request an interactive debug job with the debug command:<br />
mist-login01:~$ debugjob --clean -g G<br />
where G is the number of gpus, If G=1, this gives an interactive session for 2 hours, whereas G=4 gets you a single node with 4 gpus for 30 minutes, and with G=8 (the maximum) gets you 2 nodes each with 4 gpus for 15 minutes. The <tt>--clean</tt> argument is optional but recommended as it will start the session without any modules loaded, thus mimicking more closely what happens when you submit a job script. Users needs to load module and activate the conda environment after a debug job starts. It is recommended to do a 'conda clean' before 'source activate ENV' in a debug job if --clean flag is missed.<br />
<br />
= Submitting jobs =<br />
Once you have compiled and tested your code or workflow on the Mist login nodes, and confirmed that it behaves correctly, you are ready to submit jobs to the cluster. Your jobs will run on some of Mist's 53 compute nodes. When and where your job runs is determined by the scheduler.<br />
<br />
Mist uses SLURM as its job scheduler. It is configured to allow only '''Single-GPU jobs''' and '''Full-node jobs (4 GPUs per node)'''.<br />
<br />
You submit jobs from a login node by passing a script to the sbatch command:<br />
<br />
mist-login01:scratch$ sbatch jobscript.sh<br />
<br />
This puts the job in the queue. It will run on the compute nodes in due course. In most cases, you should not submit from your $HOME directory, but rather, from your $SCRATCH directory, so that the output of your compute job can be written out (as mentioned above, $HOME is read-only on the compute nodes).<br />
<br />
Example job scripts can be found below.<br />
Keep in mind:<br />
* Scheduling is by single gpu or by full node, so you ask only 1 gpu or 4 gpus per node.<br />
* Your job's maximum walltime is 24 hours. <br />
* Jobs must write their output to your scratch or project directory (home is read-only on compute nodes).<br />
* Compute nodes have no internet access.<br />
* Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below). <br />
== SOSCIP Users ==<br />
*[https://www.soscip.org SOSCIP] is a consortium to bring together industrial partners and academic researchers and provide them with sophisticated advanced computing technologies and expertise to solve social, technical and business challenges across sectors and drive economic growth.<br />
<br />
If you are working on a SOSCIP project, please contact [mailto:soscip-support@scinet.utoronto.ca soscip-support@scinet.utoronto.ca] to have your user account added to SOSCIP project accounts. SOSCIP users need to submit jobs with additional SLURM flag to get higher priority:<br />
<pre><br />
#SBATCH -A soscip-<SOSCIP_PROJECT_ID> #e.g. soscip-3-001<br />
OR<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID><br />
</pre><br />
<br />
== Single-GPU job script ==<br />
For a single GPU job, each will have a quarter of the node which is 1 GPU + 8/32 CPU Cores/Threads + ~58GB CPU memory. '''Users should never ask CPU or Memory explicitly.''' If running MPI program, user can set --ntasks to be the number of MPI ranks. '''Do NOT set --ntasks for non-MPI programs.''' <br />
*It is suggested to use NVIDIA Multi-Process Service (MPS) if running multiple MPI ranks on one GPU.<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID> #For SOSCIP projects only<br />
<br />
module load anaconda3<br />
source activate conda_env<br />
python code.py ...<br />
</pre><br />
<br />
== Full-node job script ==<br />
'''If you are not sure the program can be executed on multiple GPUs, please follow the single-gpu job instruction above or contact SciNet/SOSCIP support.'''<br />
<br />
Multi-GPU job should ask for a minimum of one full node (4 GPUs). User need to specify "compute_full_node" partition in order to get all resource on a node. <br />
*An example for a 1-node job:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=4 #this only affects MPI job<br />
#SBATCH --time=1:00:00<br />
#SBATCH -p compute_full_node<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID> #For SOSCIP projects only<br />
<br />
module load <modules you need><br />
Run your program<br />
</pre><br />
<br />
== Limits ==<br />
<br />
There are limits to the size and duration of your jobs, the number of jobs you can run and the number of jobs you can have queued.<br />
<br />
{| class="wikitable"<br />
!Usage<br />
!Partition<br />
!Running jobs<br />
!Submitted jobs (incl. running)<br />
!Min. size of jobs<br />
!Max. size of jobs<br />
!Min. walltime<br />
!Max. walltime <br />
|-<br />
|Compute jobs ||compute || 50 || 1000 || 1 node (40&nbsp;cores) || default:&nbsp;20&nbsp;nodes&nbsp;(800&nbsp;cores) <br> with&nbsp;allocation:&nbsp;1000&nbsp;nodes&nbsp;(40000&nbsp;cores)|| 15 minutes || 24 hours<br />
|-<br />
|Testing or troubleshooting || debug || 1 || 1 || 1 node (40 cores) || 4 nodes (160 cores)|| N/A || 1 hour<br />
|-<br />
|}<br />
<br />
Even if you respect these limits, your jobs will still have to wait in the queue. The waiting time depends on many factors such as your group's allocation amount, how much allocation has been used in the recent past, the number of requested nodes and walltime, and how many other jobs are waiting in the queue.<br />
<br />
= Jupyter Notebooks =<br />
SciNet’s [[Jupyter Hub]] is a Niagara-type node; it has a different CPU architecture and no GPUs. Conda environments prepared on Mist will not work there properly. Users who need to use Jupyter Notebook to develop and test some aspects of their workflow can create their own server on the Mist login node and use an SSH tunnel to connect to it from outside. Users who choose to do so have to keep in mind that the login node is a shared resource, and heavy calculations should be done only on compute nodes. Processes (including iPython kernels used by the notebooks) are limited to one hour of total CPU time: idle time will not be counted toward this one hour, and use of multiple cores will count proportionally to the number of cores (i.e. a kernel using all 128 virtual cores on the node will be killed after 28 seconds). Idle notebooks can still burden the node by hogging system and GPU memory, please be mindful of other users and terminate notebooks when work is done.<br />
<br />
As an example, let us create a new Conda environment and activate it:<br />
<pre><br />
module load anaconda3<br />
conda create -n jupyter_env python=3.7<br />
source activate jupyter_env<br />
</pre><br />
Install the Jupyter Notebook server:<br />
<pre><br />
conda install notebook<br />
</pre><br />
<br />
== Running the notebook server ==<br />
When the Conda environment is active, enter:<br />
<pre><br />
jupyter-notebook<br />
</pre><br />
By default, the Jupyter Notebook server uses port 8888 (can be overridden with the <code>--port</code> option). If another user has already started their own server, the default port may be busy, in which case the server will be listening on a different port. Once launched, the server will output some information to the terminal that will include the actual port number used and a 48-character token. For example:<br />
<pre>http://localhost:8890/?token=54c4090d……</pre><br />
In this example, the server is listening on port 8890.<br />
<br />
== Creating a tunnel ==<br />
In order to access this port remotely (i.e. from your office or home), an [https://en.wikipedia.org/wiki/Tunneling_protocol#Secure_Shell_tunneling SSH tunnel] has to be established. Please refer to your SSH client’s documentation for instructions on how to do that. For the OpenSSH client (standard in most Linux distributions and macOS), a tunnel can be opened in a separate terminal session to the one where the Jupyter Notebook server is running. In the new terminal, issue this command:<br />
<pre><br />
ssh -L8888:localhost:8890 <username>@mist.scinet.utoronto.ca<br />
</pre><br />
(replace <code><username></code> with your actual username) The tunnel is open as long as this SSH connection is alive. In this example, we tunnel Mist login node’s port 8890 (where our server is assumed to be running) to our home computer’s port 8888 (any other free port is fine). The notebook can be accessed in the browser at the <code><nowiki>http://localhost:8888</nowiki></code> address (followed by <code>/?token=54c4090d……</code>, or the token can be input on the webpage).<br />
<br />
== Using Jupyter on compute nodes ==<br />
<br />
You can use the instructions here to set up a Jupyter Notebook server on a compute node (including a [[#Testing_and_debugging|debugjob]]). '''We strongly discourage''' you from running an interactive notebook on a compute node (other than for a debugjob), scheduled jobs run in arbitrary times and are not meant to be interactive. Jupyter notebooks can be run non-interactively or converted to Python scripts.<br />
<br />
To launch the Jupyter Notebook server, load the <code>anaconda3</code> module and activate your environment as before (by adding the appropriate lines to the submission script, if you are not using the compute node with an interactive shell). Launching the server has to be done like so:<br />
<pre><br />
HOME=/dev/shm/$USER jupyter-notebook<br />
</pre><br />
That is because Jupyter will fail unless it can write to the home folder, which is read-only from compute nodes. This modification of the <code>$HOME</code> environment variable will carry over into the notebooks, which is usually not a problem, but in case the notebook relies on this environment variable (e.g. to read certain files), it can be reset manually in the notebook (<code>import os; os.environ['HOME']=……</code>).<br />
<br />
Because compute nodes are not accessible from the Internet, tunneling has to be done twice, once from the remote location (office or home) to the Mist login node, and then from the login node to the compute node. Assuming the server is running on port 8890 of the mist006 node, open the first tunnel in a new terminal session in the remote computer:<br />
<pre><br />
ssh -L8888:localhost:9999 <username>@mist.scinet.utoronto.ca<br />
</pre><br />
where 9999 is any available port on the Mist login node (to test port availability enter <code>ss -Hln src :9999</code> in the terminal when connected to the Mist login node; an empty output indicates that the port is free). In the same session in the login node that was created with the above command, open the second tunnel to the compute node:<br />
<pre><br />
ssh -L9999:localhost:8890 mist006<br />
</pre><br />
Be aware that the second tunnel will automatically disconnect once the job on the compute node times out or is relinquished. The Jupyter Notebook server running on the compute node can now be accessed from the browser as in the previous subsection.<br />
<br />
<br />
= Support =<br />
<br />
SciNet inquiries:<br />
* [mailto:support@scinet.utoronto.ca support@scinet.utoronto.ca]<br />
<br />
SOSCIP inquiries:<br />
*[mailto:soscip-support@scinet.utoronto.ca soscip-support@scinet.utoronto.ca]</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Mist&diff=3629
Mist
2022-03-09T14:42:18Z
<p>Ejspence: /* Full-node job script */</p>
<hr />
<div>{{Infobox Computer<br />
|image=[[Image:Mist.jpg|center|300px|thumb]]<br />
|name=Mist<br />
|installed=Dec 2019<br />
|operatingsystem= Red Hat Enterprise Linux 8.2<br />
|loginnode= mist.scinet.utoronto.ca<br />
|nnodes= 54 IBM AC922<br />
|rampernode= 256 GB <br />
|gpuspernode=4 V100-SMX2-32GB<br />
|interconnect=Mellanox EDR<br />
|vendorcompilers= NVCC, IBM XL<br />
|queuetype=Slurm<br />
}}<br />
<br />
=Specifications=<br />
Mist is a SciNet-[[#SOSCIP Users |SOSCIP]] joint GPU cluster consisting of 54 IBM AC922 servers. Each node of the cluster has 32 IBM Power9 cores, 256GB RAM and 4 NVIDIA V100-SMX2-32GB GPU with NVLINKs in between. The cluster has InfiniBand EDR interconnection providing GPU-Direct RMDA capability.<br />
<br />
'''<span style="background:#fc8383">Important note:</span>''' the majority of computer systems as of 2021 (laptops, desktops, and HPC) use the 64 bit x86 instruction set architecture (ISA) in their microprocessors produced by Intel and AMD. This ISA is incompatible with Mist, whose hardware uses the 64 bit PPC ISA (set to little endian mode). The practical meaning is that x86-compiled binaries (executables and libraries) cannot be installed on Mist. For this reason, the Niagara and Compute Canada software stacks (modules) cannot be made available on Mist, and using closed-source software is only possible when the vendor provides a compatible version of their application. '''Python applications''' almost always rely on bindings to libraries originally written in C or C++, some of them are not available on PyPI or various Conda channels as precompiled binaries compatible with Mist. The recommended way to use Python on Mist is to create a [[#Anaconda (Python)|Conda]] environment and install packages from the anaconda (default) channel, where most popular packages have a linux-ppc64le (Mist-compatible) version available. Some popular machine learning packages should be installed from the internal [[#Open-CE|Open-CE]] channel. Where a compatible Conda package cannot be found, installing from PyPI (<code>pip install</code>) can be attempted. Pip will attempt to compile the package’s source code if no compatible precompiled wheel is available, therefore a compiler module (such as <code>gcc/.core</code>) should be loaded in advance. Some packages require tweaking of the source code or build procedure to successfully compile on Mist, please contact [[#Support|support]] if you need assistance.<br />
<br />
= Getting started on Mist =<br />
As of January 22 2022, authentication is only allowed via SSH keys. [https://docs.computecanada.ca/wiki/SSH_Keys Please refer to this page] to generate your SSH key pair and make sure you use them securely.<br />
<br />
Mist can be accessed directly:<br />
<pre><br />
ssh -i /path/to/ssh_private_key -Y MYCCUSERNAME@mist.scinet.utoronto.ca<br />
</pre><br />
Mist login node '''mist-login01''' can also be accessed via Niagara cluster.<br />
<pre><br />
ssh -i /path/to/ssh_private_key -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y mist-login01<br />
</pre><br />
== Storage ==<br />
The filesystem for Mist is shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on Mist: use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]] and a list of [[Modules for Mist]] is also available.<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
== Tips for loading software ==<br />
<br />
* We advise '''''against''''' loading modules in your .bashrc. This can lead to very confusing behaviour under certain circumstances. Our guidelines for .bashrc files can be found [[bashrc guidelines|here]].<br />
* Instead, load modules by hand when needed, or by sourcing a separate script.<br />
* Load run-specific modules inside your job submission script.<br />
* Short names give default versions; e.g. <code>cuda</code> → <code>cuda/11.0.3</code>. It is usually better to be explicit about the versions, for future reproducibility.<br />
* Modules often require other modules to be loaded first. Solve these dependencies by using [[Using_modules#Module_spider | <code>module spider</code>]].<br />
<br />
= Available compilers and interpreters =<br />
* <tt>cuda</tt> module has to be loaded first for GPU software.<br />
* For most compiled software, one should use the GNU compilers (<tt>gcc</tt> for C, <tt>g++</tt> for C++, and <tt>gfortran</tt> for Fortran). Loading <tt>gcc</tt> module makes these available. <br />
* The IBM XL compiler suite (<tt>xlc_r, xlc++_r, xlf_r</tt>) is also available, if you load one of the <tt>xl</tt> modules.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> or <tt>spectrum-mpi</tt> module.<br />
<br />
=== CUDA ===<br />
<br />
The current installed CUDA Tookits are '''11.0.3''' and '''10.2.2 (10.2.89)'''<br />
<pre><br />
module load cuda/11.0.3<br />
module load cuda/10.2.2<br />
</pre><br />
*A compiler (GCC, XL or NVHPC/PGI) module must be loaded in order to use CUDA to build any code.<br />
The current NVIDIA driver version is 450.119.04.<br />
<br />
===GNU Compilers ===<br />
<br />
Available GCC modules are:<br />
<pre><br />
gcc/9.3.0 (must load CUDA 11)<br />
gcc/8.5.0 (must load CUDA 10 or 11)<br />
gcc/10.3.0 (w/o CUDA)<br />
</pre><br />
<br />
=== IBM XL Compilers ===<br />
<br />
To load the native IBM xlc/xlc++ and xlf (Fortran) compilers, run<br />
<br />
<pre><br />
module load xl/16.1.1.10<br />
</pre><br />
<br />
IBM XL Compilers are enabled for use with NVIDIA GPUs, including support for OpenMP GPU offloading and integration with NVIDIA's nvcc command to compile host-side code for the POWER9 CPU. Information about the IBM XL Compilers can be found at the following links:[https://www.ibm.com/support/knowledgecenter/SSXVZZ_16.1.1/com.ibm.compilers.linux.doc/welcome.html IBM XL C/C++], <br />
[https://www.ibm.com/support/knowledgecenter/SSAT4T_16.1.1/com.ibm.compilers.linux.doc/welcome.html IBM XL Fortran]<br />
<br />
=== OpenMPI ===<br />
<tt>openmpi/<version></tt> module is avaiable with different compilers including GCC and XL. <tt>spectrum-mpi/<version></tt> module provides IBM Spectrum MPI.<br />
<br />
=== NVHPC/PGI ===<br />
PGI compiler is provided in NVHPC (NVIDIA HPC SDK).<br />
<pre><br />
module load nvhpc/21.3<br />
</pre><br />
<br />
= Software =<br />
== Amber20 ==<br />
<br />
Users who hold Amber20 license can build Amber20 from its source code and run on Mist. '''SOSCIP/SciNet doesn't provide Amber license or source code.'''<br />
<br />
=== Building Amber20 ===<br />
Modules that are needed for building Amber20:<br />
<pre><br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 anaconda3/2021.05 cmake/3.19.8<br />
</pre><br />
Cmake configuration:<br />
<pre><br />
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/where-amber-install -DCOMPILER=GNU -DMPI=FALSE -DCUDA=TRUE -DINSTALL_TESTS=TRUE -DDOWNLOAD_MINICONDA=FALSE -DOPENMP=TRUE -DNCCL=FALSE -DAPPLY_UPDATES=TRUE<br />
</pre><br />
<br />
=== Running Amber20 ===<br />
'''NVIDIA Pascal P100 and later GPUs like V100 do not scale beyond a single GPU'''. It is highly suggested to run Amber20 as a single-gpu job.<br />
A job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
#SBATCH --account=soscip-<SOSCIP-project-ID><br />
<br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 anaconda3/2021.05<br />
export PATH=$HOME/where-amber-install/bin:$PATH<br />
export LD_LIBRARY_PATH=$HOME/where-amber-install/lib:$LD_LIBRARY_PATH<br />
pmemd.cuda .... <parameters> ...<br />
</pre><br />
<br />
== Anaconda (Python) ==<br />
Anaconda is a popular distribution of the Python programming language. It contains several common Python libraries such as SciPy and NumPy as pre-built packages, which eases installation. Anaconda is provided as modules: '''anaconda3'''<br />
<br />
To install Anaconda locally, user need to load the module and create a conda environment:<br />
<pre><br />
module load anaconda3<br />
conda create -n myPythonEnv python=3.8<br />
</pre><br />
*Note: By default, conda environments are located in '''$HOME/.conda/envs'''. Cache (downloaded tarballs and packages) is under '''$HOME/.conda/pkgs'''. User may run into problem with disk quota if there are too many environments created. To clean conda cache, '''please run: "conda clean -y --all" and "rm -rf $HOME/.conda/pkgs/*" after installation of packages'''.<br />
<br />
To activate the conda environment: (should be activated before running python)<br />
<pre><br />
source activate myPythonEnv<br />
</pre><br />
Note that you SHOULD NOT use '''conda activate myPythonEnv''' to activate the environment. This leads to all sorts of problems. Once the environment is activated, user can update or install packages via '''conda''' or '''pip'''<br />
<pre><br />
conda install <package_name> (preferred way to install packages)<br />
pip install <package_name><br />
</pre><br />
*Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
To deactivate:<br />
<pre><br />
source deactivate<br />
</pre><br />
To remove a conda environment:<br />
<pre><br />
conda remove --name myPythonEnv --all<br />
</pre><br />
To verify that the environment was removed, run:<br />
<pre><br />
conda info --envs<br />
</pre><br />
<br />
=== Submitting Python Job ===<br />
A single-gpu job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID> #For SOSCIP projects only<br />
<br />
module load anaconda3<br />
source activate myPythonEnv<br />
python code.py ...<br />
</pre><br />
<br />
== CuPy ==<br />
[https://cupy.chainer.org CuPy] is an open-source matrix library accelerated with NVIDIA CUDA. It also uses CUDA-related libraries including cuBLAS, cuDNN, cuRand, cuSolver, cuSPARSE, cuFFT and NCCL to make full use of the GPU architecture. CuPy is an implementation of NumPy-compatible multi-dimensional array on CUDA. CuPy consists of the core multi-dimensional array class, cupy.ndarray, and many functions on it. It supports a subset of numpy.ndarray interface.<br />
<br />
CuPy can be install into any conda environment. Python packages: numpy, six and fastrlock are required. cuDNN and NCCL are optional.<br />
<pre><br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 nccl/2.9.9 anaconda3/2021.05<br />
conda create -n cupy-env python=3.8 numpy six fastrlock<br />
source activate cupy-env<br />
CFLAGS="-I$MODULE_CUDNN_PREFIX/include -I$MODULE_NCCL_PREFIX/include -I$MODULE_CUDA_PREFIX/include" LDFLAGS="-L$MODULE_CUDNN_PREFIX/lib64 -L$MODULE_NCCL_PREFIX/lib" CUDA_PATH=$MODULE_CUDA_PREFIX pip install cupy<br />
#building/installing CuPy will take a few minutes<br />
</pre><br />
<br />
== Gromacs ==<br />
[http://www.gromacs.org/ GROMACS] is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.<br />
*'''GROMACS 2019'''<br />
<pre><br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2019.6<br />
</pre><br />
*'''GROMACS 2020 and 2021''' Thread-MPI version supports full GPU enablement of all key computational sections. The GPU is used throughout the timestep and repeated CPU-GPU transfers are eliminated. Users are suggested to carefully verify the results.<br />
<pre><br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2020.4<br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2020.6<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.2<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 openmpi/4.1.1+ucx-1.10.0 gromacs/2021.2 (testing purpose only)<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.4<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
</pre><br />
=== Small/Medium Simulation ===<br />
Due to the lack of PME domain decomposition support on GPU, Gromacs uses CPU to calculate PME when using multiple GPUs. '''It is always recommended to use a single GPU to do small and medium sized simulations with Gromacs.''' By using only 1 tMPI thread (w/ multiple OpenMP threads) on a single GPU, both non-bonded PP and PME are atomically offloaded to GPU when possible.<br />
* Gromacs 2019 example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load MistEnv/2021a cuda/10.2.2 gcc/8.5.0 gromacs/2019.6<br />
export OMP_NUM_THREADS=8<br />
export OMP_PLACES=cores<br />
gmx mdrun -pin off -ntmpi 1 -ntomp 8 ... <other parameters><br />
</pre><br />
<br />
* Gromacs 2020 or 2021 example: <br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
export OMP_NUM_THREADS=8<br />
export OMP_PLACES=cores<br />
export GMX_FORCE_UPDATE_DEFAULT_GPU=true<br />
gmx mdrun -pin off -ntmpi 1 -ntomp 8 ... <other parameters><br />
</pre><br />
<br />
=== Large Simulation ===<br />
If memory size (~58GB) for single-gpu job is not sufficient for the simulation, multiple GPUs can be used. It is suggested to test starting with one full node with 4GPUs and force PME on GPU. Multiple PME ranks are not supported with PME on GPU, so if GPU is used for the PME calculation -npme (number of PME ranks) must be set to 1. If PME has less work than PP, it is suggested to run multiple ranks per GPU, so the GPU for PME rank can also do some work on PP rank(s).<br />
'''If your simulation can fit in a single GPU job, please use single GPU to get much higher efficiency. Do not waste 3 additional GPU resource for getting only a small performance improvement.<br />
'''<br />
*An example using 4 GPUs, 7 PP ranks/tmpi threads + 1 PME rank/tmpi thread: ('''-pin on -pme gpu -npme 1''' must be added to mdrun command in order to force GPU to do PME)<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=1<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
<br />
export OMP_NUM_THREADS=4<br />
gmx mdrun -ntmpi 8 -pin on -pme gpu -npme 1 ... <add your parameters><br />
</pre><br />
*It is suggested to also test using '''-ntmpi 4''' and '''export OMP_NUM_THREADS=8''' if you receive a NOTE in Gromacs output saying "% performance was lost because the PME ranks had more work to do than the PP ranks". In this case, NVIDIA MPS is not needed since there is only one MPI rank per GPU.<br />
*'''Please note that the solving of PME on GPU is still only the initial version supporting this behaviour, and comes with a set of limitations outlined further below.'''<br />
<pre><br />
* Only a PME order of 4 is supported on GPUs.<br />
* PME will run on a GPU only when exactly one rank has a PME task, ie. decompositions with multiple ranks doing PME are not supported.<br />
* Only single precision is supported.<br />
* Free energy calculations where charges are perturbed are not supported, because only single PME grids can be calculated.<br />
* Only dynamical integrators are supported (ie. leap-frog, Velocity Verlet, stochastic dynamics)<br />
* LJ PME is not supported on GPUs.<br />
</pre><br />
*An example using 4 GPUs, '''PME on CPU''': ('''-pin on''' must be added to mdrun command for proper CPU thread bindings)<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=1<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 gromacs/2021.5<br />
<br />
export OMP_NUM_THREADS=4<br />
gmx mdrun -ntmpi 8 -pin on ... <add your parameters><br />
<br />
# "-ntmpi 16, OMP_NUM_THREADS=2" and "-ntmpi 4, OMP_NUM_THREADS=8" should also be tested. <br />
# num_thread_MPI_ranks(-ntmpi) * num_OpenMP_threads = 32<br />
</pre><br />
*'''If your simulation can fit in a single GPU job, please use single GPU to get much higher efficiency. Do not waste 3 additional GPU resource for getting only a small performance improvement.'''<br />
*'''NOTE: The above examples will NOT work with multiple nodes. If simulation is too large for a single GPU node, please contact SciNet/SOSCIP support.'''<br />
<br />
== NAMD ==<br />
[http://www.ks.uiuc.edu/Research/namd/ NAMD] is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems.<br />
=== 2.14 ===<br />
<pre><br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
</pre><br />
==== Running with single GPU ====<br />
If you have many jobs to run, it is always suggested to run with a single gpu per job. This makes jobs easier to be scheduled and gives better overall performance.<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --nodes=1<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
scontrol show hostnames > nodelist-$SLURM_JOB_ID<br />
<br />
`which charmrun` -npernode 1 -bind-to none -hostfile nodelist-$SLURM_JOB_ID `which namd2` +idlepoll +ppn 8 +p 8 stmv.namd<br />
</pre><br />
<br />
==== Running with one process per node (4 GPUs)====<br />
An example of the job script (using 1 node, '''one process per node''', 32 CPU threads per process + 4 GPUs per process):<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=1<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
scontrol show hostnames > nodelist-$SLURM_JOB_ID<br />
<br />
`which charmrun` -npernode 1 -hostfile nodelist-$SLURM_JOB_ID `which namd2` +setcpuaffinity +pemap 0-127:4 +idlepoll +ppn 32 +p $((32*SLURM_NTASKS)) stmv.namd<br />
</pre><br />
==== Running with one process per GPU (4 GPUs)====<br />
NAMD may scale better if using '''one process per GPU'''. Please do your own benchmark.<br />
An example of the job script (using 1 node, '''one process per GPU''', 8 CPU threads per process):<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=20:00<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=4<br />
#SBATCH --nodes=1<br />
#SBATCH -p compute_full_node<br />
<br />
module load MistEnv/2021a cuda/11.0.3 gcc/9.4.0 spectrum-mpi/10.4.0 namd/2.14<br />
scontrol show hostnames > nodelist-$SLURM_JOB_ID<br />
<br />
`which charmrun` -npernode 4 -hostfile nodelist-$SLURM_JOB_ID `which namd2` +setcpuaffinity +pemap 0-127:4 +idlepoll +ppn 8 +p $((8*SLURM_NTASKS)) stmv.namd<br />
</pre><br />
<br />
== Open-CE ==<br />
[https://github.com/open-ce/open-ce Open-CE] is an '''IBM''' repo for feedstock collection, environment data, and scripts for building Tensorflow, Pytorch, and other machine learning packages and dependencies. Open-CE is distributed as a '''conda channel''' on Mist cluster.<br />
'''Available packages and versions are listed here [https://github.com/open-ce/open-ce/releases/tag/open-ce-v1.5.2 Open-CE Releases]'''. Currently only python 3.8 and CUDA 11.2 are supported. If you need a different python or cuda version, please contact SOSCIP or SciNet support.<br />
<br />
*Packages can be installed by setting Open-CE conda channel:<br />
<pre><br />
conda install -c /scinet/mist/ibm/open-ce python=3.8 cudatoolkit=11.2 PACKAGE<br />
</pre><br />
*Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
<br />
{| class="wikitable"<br />
|+Available Packages:<br />
|-<br />
|Tensorflow<br />
|TensorFlow Estimators<br />
|TensorFlow Probability<br />
|TensorBoard<br />
|TensorBoard Data Server<br />
|TensorFlow Text<br />
|TensorFlow Model Optimizations<br />
|TensorFlow Addons<br />
|TensorFlow Datasets<br />
|TensorFlow Hub<br />
|-<br />
|TensorFlow MetaData<br />
|PyTorch<br />
|TorchText<br />
|TorchVision<br />
|PyTorch Lightning<br />
|PyTorch Lightning Bolts<br />
|ONNX<br />
|Onnx-runtime<br />
|skl2onnx<br />
|tf2onnx<br />
|-<br />
|onnxmltools<br />
|onnxconverter-common<br />
|XGBoost<br />
|LightGBM<br />
|Transformers<br />
|Tokenizers<br />
|SentencePiece<br />
|Spacy<br />
|DALI<br />
|OpenCV<br />
|-<br />
|Horovod<br />
|PyArrow<br />
|grpc<br />
|uwsgi<br />
|ORC<br />
|Mamba<br />
|}<br />
<br />
== PyTorch ==<br />
=== Installing from IBM Open-CE Conda Channel ===<br />
The easiest way to install PyTorch on Mist is using IBM's Conda channel. User needs to prepare a conda environment and install PyTorch using IBM's Open-CE Conda channel.<br />
<pre><br />
module load anaconda3<br />
conda create -n pytorch_env python=3.8<br />
source activate pytorch_env<br />
conda install -c /scinet/mist/ibm/open-ce pytorch=1.10.2 cudatoolkit=11.2<br />
or<br />
conda install -c /scinet/mist/ibm/open-ce-1.2 pytorch=1.7.1 cudatoolkit=11.0 (or 10.2)<br />
</pre><br />
Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
Add below command into your job script before python command to get deterministic results, see details here: [https://github.com/pytorch/pytorch/issues/39849]<br />
<pre><br />
export CUBLAS_WORKSPACE_CONFIG=:4096:2<br />
</pre><br />
<br />
== RAPIDS ==<br />
The [https://rapids.ai RAPIDS] is a suite of open source software libraries that gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. The RAPIDS data science framework includes a collection of libraries: '''cuDF(GPU DataFrames)''', '''cuML(GPU Machine Learning Algorithms)''', '''cuStrings(GPU String Manipulation)''', etc.<br />
<br />
=== Installing from IBM Conda Channel ===<br />
The easiest way to install RAPIDS on Mist is using IBM's Conda channel. User needs to prepare a conda environment with Python 3.6 or 3.7 and install powerai-rapids using IBM's Conda channel.<br />
<pre><br />
module load anaconda3<br />
conda create -n rapids_env python=3.7<br />
source activate rapids_env<br />
conda install -c https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/ powerai-rapids<br />
</pre><br />
Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
<br />
== TensorFlow and Keras ==<br />
=== Installing from IBM Conda Channel ===<br />
The easiest way to install TensorFlow and Keras on Mist is using IBM's Open-CE Conda channel. User needs to prepare a conda environment and install TensorFlow using IBM's Open-CE Conda channel.<br />
<pre><br />
module load anaconda3<br />
conda create -n tf_env python=3.8<br />
source activate tf_env<br />
conda install -c /scinet/mist/ibm/open-ce tensorflow==2.7.1 cudatoolkit=11.2<br />
or<br />
conda install -c /scinet/mist/ibm/open-ce-1.2 tensorflow==2.4.3 cudatoolkit=11.0 (or 10.2)<br />
</pre><br />
Once the installation finishes, please clean the cache:<br />
<pre><br />
conda clean -y --all<br />
rm -rf $HOME/.conda/pkgs/*<br />
</pre><br />
<br />
= Testing and debugging =<br />
You really should test your code before you submit it to the cluster to know if your code is correct and what kind of resources you need.<br />
* Small test jobs can be run on the login node. Rule of thumb: tests should run no more than a couple of minutes, taking at most about 1-2GB of memory, and use no more than one gpu and a few cores.<br />
<!-- * You can run the [[Parallel Debugging with DDT|DDT]] debugger on the login nodes after <code>module load ddt</code>. --><br />
* Short tests that do not fit on a login node, or for which you need a dedicated node, request an interactive debug job with the debug command:<br />
mist-login01:~$ debugjob --clean -g G<br />
where G is the number of gpus, If G=1, this gives an interactive session for 2 hours, whereas G=4 gets you a single node with 4 gpus for 30 minutes, and with G=8 (the maximum) gets you 2 nodes each with 4 gpus for 15 minutes. The <tt>--clean</tt> argument is optional but recommended as it will start the session without any modules loaded, thus mimicking more closely what happens when you submit a job script. Users needs to load module and activate the conda environment after a debug job starts. It is recommended to do a 'conda clean' before 'source activate ENV' in a debug job if --clean flag is missed.<br />
<br />
= Submitting jobs =<br />
Once you have compiled and tested your code or workflow on the Mist login nodes, and confirmed that it behaves correctly, you are ready to submit jobs to the cluster. Your jobs will run on some of Mist's 53 compute nodes. When and where your job runs is determined by the scheduler.<br />
<br />
Mist uses SLURM as its job scheduler. It is configured to allow only '''Single-GPU jobs''' and '''Full-node jobs (4 GPUs per node)'''.<br />
<br />
You submit jobs from a login node by passing a script to the sbatch command:<br />
<br />
mist-login01:scratch$ sbatch jobscript.sh<br />
<br />
This puts the job in the queue. It will run on the compute nodes in due course. In most cases, you should not submit from your $HOME directory, but rather, from your $SCRATCH directory, so that the output of your compute job can be written out (as mentioned above, $HOME is read-only on the compute nodes).<br />
<br />
Example job scripts can be found below.<br />
Keep in mind:<br />
* Scheduling is by single gpu or by full node, so you ask only 1 gpu or 4 gpus per node.<br />
* Your job's maximum walltime is 24 hours. <br />
* Jobs must write their output to your scratch or project directory (home is read-only on compute nodes).<br />
* Compute nodes have no internet access.<br />
* Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below). <br />
== SOSCIP Users ==<br />
*[https://www.soscip.org SOSCIP] is a consortium to bring together industrial partners and academic researchers and provide them with sophisticated advanced computing technologies and expertise to solve social, technical and business challenges across sectors and drive economic growth.<br />
<br />
If you are working on a SOSCIP project, please contact [mailto:soscip-support@scinet.utoronto.ca soscip-support@scinet.utoronto.ca] to have your user account added to SOSCIP project accounts. SOSCIP users need to submit jobs with additional SLURM flag to get higher priority:<br />
<pre><br />
#SBATCH -A soscip-<SOSCIP_PROJECT_ID> #e.g. soscip-3-001<br />
OR<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID><br />
</pre><br />
<br />
== Single-GPU job script ==<br />
For a single GPU job, each will have a quarter of the node which is 1 GPU + 8/32 CPU Cores/Threads + ~58GB CPU memory. '''Users should never ask CPU or Memory explicitly.''' If running MPI program, user can set --ntasks to be the number of MPI ranks. '''Do NOT set --ntasks for non-MPI programs.''' <br />
*It is suggested to use NVIDIA Multi-Process Service (MPS) if running multiple MPI ranks on one GPU.<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID> #For SOSCIP projects only<br />
<br />
module load anaconda3<br />
source activate conda_env<br />
python code.py ...<br />
</pre><br />
<br />
== Full-node job script ==<br />
'''If you are not sure the program can be executed on multiple GPUs, please follow the single-gpu job instruction above or contact SciNet/SOSCIP support.'''<br />
<br />
Multi-GPU job should ask for a minimum of one full node (4 GPUs). User need to specify "compute_full_node" partition in order to get all resource on a node. <br />
*An example for a 1-node job:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=4<br />
#SBATCH --ntasks=4 #this only affects MPI job<br />
#SBATCH --time=1:00:00<br />
#SBATCH -p compute_full_node<br />
#SBATCH --account=soscip-<SOSCIP_PROJECT_ID> #For SOSCIP projects only<br />
<br />
module load <modules you need><br />
Run your program<br />
</pre><br />
<br />
== Limits ==<br />
<br />
There are limits to the size and duration of your jobs, the number of jobs you can run and the number of jobs you can have queued.<br />
<br />
{| class="wikitable"<br />
!Usage<br />
!Partition<br />
!Running jobs<br />
!Submitted jobs (incl. running)<br />
!Min. size of jobs<br />
!Max. size of jobs<br />
!Min. walltime<br />
!Max. walltime <br />
|-<br />
|Compute jobs ||compute || 50 || 1000 || 1 node (40&nbsp;cores) || default:&nbsp;20&nbsp;nodes&nbsp;(800&nbsp;cores) <br> with&nbsp;allocation:&nbsp;1000&nbsp;nodes&nbsp;(40000&nbsp;cores)|| 15 minutes || 24 hours<br />
|-<br />
|Testing or troubleshooting || debug || 1 || 1 || 1 node (40 cores) || 4 nodes (160 cores)|| N/A || 1 hour<br />
|-<br />
|Archiving or retrieving data in [[HPSS]]|| archivelong || 2 per user (max 5 total) || 10 per user || N/A || N/A|| 15 minutes || 72 hours<br />
|-<br />
|Inspecting archived data, small archival actions in [[HPSS]] || archiveshort || 2 per user|| 10 per user || N/A || N/A || 15 minutes || 1 hour<br />
|}<br />
<br />
Even if you respect these limits, your jobs will still have to wait in the queue. The waiting time depends on many factors such as your group's allocation amount, how much allocation has been used in the recent past, the number of requested nodes and walltime, and how many other jobs are waiting in the queue.<br />
<br />
= Jupyter Notebooks =<br />
SciNet’s [[Jupyter Hub]] is a Niagara-type node; it has a different CPU architecture and no GPUs. Conda environments prepared on Mist will not work there properly. Users who need to use Jupyter Notebook to develop and test some aspects of their workflow can create their own server on the Mist login node and use an SSH tunnel to connect to it from outside. Users who choose to do so have to keep in mind that the login node is a shared resource, and heavy calculations should be done only on compute nodes. Processes (including iPython kernels used by the notebooks) are limited to one hour of total CPU time: idle time will not be counted toward this one hour, and use of multiple cores will count proportionally to the number of cores (i.e. a kernel using all 128 virtual cores on the node will be killed after 28 seconds). Idle notebooks can still burden the node by hogging system and GPU memory, please be mindful of other users and terminate notebooks when work is done.<br />
<br />
As an example, let us create a new Conda environment and activate it:<br />
<pre><br />
module load anaconda3<br />
conda create -n jupyter_env python=3.7<br />
source activate jupyter_env<br />
</pre><br />
Install the Jupyter Notebook server:<br />
<pre><br />
conda install notebook<br />
</pre><br />
<br />
== Running the notebook server ==<br />
When the Conda environment is active, enter:<br />
<pre><br />
jupyter-notebook<br />
</pre><br />
By default, the Jupyter Notebook server uses port 8888 (can be overridden with the <code>--port</code> option). If another user has already started their own server, the default port may be busy, in which case the server will be listening on a different port. Once launched, the server will output some information to the terminal that will include the actual port number used and a 48-character token. For example:<br />
<pre>http://localhost:8890/?token=54c4090d……</pre><br />
In this example, the server is listening on port 8890.<br />
<br />
== Creating a tunnel ==<br />
In order to access this port remotely (i.e. from your office or home), an [https://en.wikipedia.org/wiki/Tunneling_protocol#Secure_Shell_tunneling SSH tunnel] has to be established. Please refer to your SSH client’s documentation for instructions on how to do that. For the OpenSSH client (standard in most Linux distributions and macOS), a tunnel can be opened in a separate terminal session to the one where the Jupyter Notebook server is running. In the new terminal, issue this command:<br />
<pre><br />
ssh -L8888:localhost:8890 <username>@mist.scinet.utoronto.ca<br />
</pre><br />
(replace <code><username></code> with your actual username) The tunnel is open as long as this SSH connection is alive. In this example, we tunnel Mist login node’s port 8890 (where our server is assumed to be running) to our home computer’s port 8888 (any other free port is fine). The notebook can be accessed in the browser at the <code><nowiki>http://localhost:8888</nowiki></code> address (followed by <code>/?token=54c4090d……</code>, or the token can be input on the webpage).<br />
<br />
== Using Jupyter on compute nodes ==<br />
<br />
You can use the instructions here to set up a Jupyter Notebook server on a compute node (including a [[#Testing_and_debugging|debugjob]]). '''We strongly discourage''' you from running an interactive notebook on a compute node (other than for a debugjob), scheduled jobs run in arbitrary times and are not meant to be interactive. Jupyter notebooks can be run non-interactively or converted to Python scripts.<br />
<br />
To launch the Jupyter Notebook server, load the <code>anaconda3</code> module and activate your environment as before (by adding the appropriate lines to the submission script, if you are not using the compute node with an interactive shell). Launching the server has to be done like so:<br />
<pre><br />
HOME=/dev/shm/$USER jupyter-notebook<br />
</pre><br />
That is because Jupyter will fail unless it can write to the home folder, which is read-only from compute nodes. This modification of the <code>$HOME</code> environment variable will carry over into the notebooks, which is usually not a problem, but in case the notebook relies on this environment variable (e.g. to read certain files), it can be reset manually in the notebook (<code>import os; os.environ['HOME']=……</code>).<br />
<br />
Because compute nodes are not accessible from the Internet, tunneling has to be done twice, once from the remote location (office or home) to the Mist login node, and then from the login node to the compute node. Assuming the server is running on port 8890 of the mist006 node, open the first tunnel in a new terminal session in the remote computer:<br />
<pre><br />
ssh -L8888:localhost:9999 <username>@mist.scinet.utoronto.ca<br />
</pre><br />
where 9999 is any available port on the Mist login node (to test port availability enter <code>ss -Hln src :9999</code> in the terminal when connected to the Mist login node; an empty output indicates that the port is free). In the same session in the login node that was created with the above command, open the second tunnel to the compute node:<br />
<pre><br />
ssh -L9999:localhost:8890 mist006<br />
</pre><br />
Be aware that the second tunnel will automatically disconnect once the job on the compute node times out or is relinquished. The Jupyter Notebook server running on the compute node can now be accessed from the browser as in the previous subsection.<br />
<br />
<br />
= Support =<br />
<br />
SciNet inquiries:<br />
* [mailto:support@scinet.utoronto.ca support@scinet.utoronto.ca]<br />
<br />
SOSCIP inquiries:<br />
*[mailto:soscip-support@scinet.utoronto.ca soscip-support@scinet.utoronto.ca]</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3395
Main Page
2022-01-06T13:21:15Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Down |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up|HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<br />
<b>Thu Jan 6 08:20 EST AM 2022</b> The SciNet filesystem is having issues. We are investigating.<br />
<br />
<!-- <b><span style="color:red">Emergency shutdown Thursday January 6, 2022</span></b>: An emergency shutdown of all SciNet to replace a crucial file system component is planned to take place on Thursday January 6, 2022, and will require at least 4 hours of downtime. Updates on the downtime will be sent out closer to the date. --><br />
<br />
<br />
<b>Fri Dec 24 13:31 EST PM 2021</b> Please note the following scheduled network maintenance, which will result in loss of connectivity to the SciNet datacentre: Start time<br />
Dec 29, 00:30 EST Estimated duration 4 hours and 30 minutes. <br />
<br />
<b>Mon Dec 20 4:29 EST PM 2021</b> Filesystem is back to normal. <br />
<br />
<b>Mon Dec 20 2:53 EST PM 2021</b> Filesystem problem - We are investigating. <br />
<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH keys]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Python&diff=3358
Python
2021-12-09T14:13:52Z
<p>Ejspence: /* Error messages */</p>
<hr />
<div>[http://www.python.org/ Python] is programing language that continues to grow in popularity for scientific computing. It is very fast to write code in, but the software that results is much much slower than C or Fortran; one should be wary of doing too much compute-intensive work in Python. <br />
__FORCETOC__ <br />
<br />
= Python on Niagara =<br />
<br />
We currently have two families of Python installed on [[Niagara_Quickstart|Niagara]]. <br />
<br />
* Regular Python<br />
* Intel Python (a variant of anaconda)<br />
<br />
Here we describe the differences between these packages.<br />
<br />
Note that it is highly recommended that you use the NiaEnv/2019b stack by loading the corresponding module, ie.:<br />
<br />
module load NiaEnv/2019b<br />
<br />
If you do not, you are using the 2018a stack whose python setup is less optimal.<br />
<br />
== Regular Python ==<br />
<br />
Python versions 2.7 and 3.6, 3.7, and 3.8 have been installed from source and are optimized for Niagara. We call these 'regular' python versions because they are not dependent on other distribution mechanisms like (ana)conda. Such distributions do not play well with the rest of the software stack, so the 'regular' python modules should be your first choice.<br />
<br />
In the [[Niagara_Quickstart#Software_stacks:_NiaEnv_and_CCEnv | Niagara Software Stack]] version 2019b, i.e., NiaEnv/2019b, the specific versions are 2.7.15 and 3.8.5, so you can load python 2 or python 3 using<br />
<br />
module load python/2.7.15<br />
module load python/3.8.5<br />
<br />
Both these installations come with the following optimized python packages preinstalled:<br />
<br />
virtualenv<br />
numpy<br />
scipy<br />
scikit-learn<br />
ipp<br />
daal<br />
cython <br />
matplotlib<br />
ipython<br />
numba<br />
numexpr<br />
pandas<br />
line_profiler<br />
memory_profiler<br />
funcsigs<br />
pycosat<br />
pyeditline<br />
pyOpenSSL<br />
PySocks<br />
PyYAML<br />
requests <br />
xgboost<br />
<br />
In addition, the python/3.8.5 module also has dask installed.<br />
<br />
In the previous NiaEnv/2018a stack, the regular python versions did not have these packages, and users needed to install them in their own home directory. This was wasteful in terms of storage and has occasional led to quota issues, so we highly recommend using the NiaEnv/2019b packages, which is the default since September 1, 2020.<br />
<br />
<br />
Additional packages in these module should be installed in [[#Using_Virtualenv_in_Regular_Python | virtual environments]].<br />
<br />
== Intel Python ==<br />
<br />
The Intel Python modules are based on the Anaconda package, a python distribution that aims to simplify package management. Intel has modified the package, and optimized the libraries to use the MKL libraries, which should make them faster than the Anaconda modules for some calculations. These modifications have also been incorporated in the intel-<tt>PACKAGES</tt> included in the regular python modules discussed above, but with Intel Python, you also get the conda command. You can load the python 2 version or the python 3 version of intel python with<br />
<br />
module load intelpython2<br />
module load intelpython3<br />
<br />
Packages in this module can be installed in so-called conda environments (see below), although virtualenv also works. <br />
<br />
A word of caution: conda environment are very wasteful when it comes to the number of files that they store in your home directory, and there is a good chance you will hit your quote of 250,000 files with only a few conda environments. And conda being a package manager on its own means that it does not always work well in combination with the rest of the software stack.<br />
<br />
== Miniconda and Anaconda ==<br />
<br />
If your are looking for anaconda or miniconda, you should find that intelpython is a good substitute. In the NiaEnv/2019b stack, we no longer provide anaconda modules, but we do have aliases conda2 and conda3 for intelpython2 and intelpython3. <br />
<br />
We advice against installing your own anaconda or miniconda in your home directory. Instead, start from one of the intelpython modules and use conda environments, or, even better, start from a regular python module and create a virtualenv in which you can install your own packages. Installing your own anaconda or miniconda would cause many more files to be installed in your $HOME directory, and this might cause trouble with the quota on the number of files.<br />
<br />
= Installing your own Python Modules =<br />
<br />
If you need to install your own Python modules, either in regular python or with conda, you should set up a virtual or conda environment. Visit the [[Installing your own Python Modules]] page for instructions on how to set this up.<br />
<br />
We would urge you do remove any conda or virtual environments that you are not using, to help reduce the number of files on the $HOME file system.<br />
<br />
{{:Installing your own Python Modules}}<br />
<br />
= Running serial Python jobs =<br />
<br />
As with all serial jobs, if your Python computation does not use multiple cores, you should bundle them up so the 40 cores of a node are all performing work. Examples of this can be found on [[Running_Serial_Jobs_on_Niagara|this]] page.<br />
<br />
= Using a Jupyter Notebook =<br />
<br />
== Jupyter Hub ==<br />
You may develop your Python scripts in a Jupyter Notebook on Niagara. A node has been set aside as a Jupyter Hub. See [[Jupyter_Hub | the Jupyter Hub page]] for details on how to access that node, and develop your code.<br />
<br />
The Jupyter Hub is a shared resource, much like the login nodes. You should not use it for extensive computations. For that you'll need to run Jupyter on a compute node.<br />
<br />
== Running Jupyter on a Niagara Compute Node == <br />
<br />
If you need more memory or more cores for your notebook calculation, you should request a node through the scheduler and run Jupyter on it yourself.<br />
<br />
1. To be able to run Jupyter on a compute node, you must first (a) install it inside a virtual environment, (b) enable a way for jupyter to seemingly write to a specific directory on $HOME, and (c) create a little helper script called <tt>notebook.sh</tt> that will be used to start the jupyter server in step 2. These are the command that you should use for the installation (which you should do only once, on a login node):<br />
<br />
(a) Create virtual env<br />
<pre><br />
$ module load NiaEnv/2019b python/3.8.5<br />
$ virtualenv --system-site-packages $HOME/.virtualenvs/jupyter<br />
$ source $HOME/.virtualenvs/jupyter/bin/activate<br />
$ pip install jupyter jupyterlab<br />
$ deactivate<br />
</pre><br />
You can choose another directory than <tt>$HOME/.virtualenvs/jupyter</tt> for where to create the virtual environment, but you need to be consistent and use the same directory everywhere.<br />
<br />
(b) Make a writable 'runtime' directory for Jupyter. <br />
<pre><br />
$ mkdir -p $HOME/.local/share/jupyter/runtime <br />
$ mv -f $HOME/.local/share/jupyter/runtime $SCRATCH/jupyter_runtime || mkdir $SCRATCH/jupyter_runtime<br />
$ ln -sT $SCRATCH/jupyter_runtime $HOME/.local/share/jupyter/runtime<br />
</pre><br />
<br />
(c) Create a launch script.<br />
<pre><br />
$ cat > $HOME/.virtualenvs/jupyter/bin/notebook.sh <<EOF<br />
#!/bin/bash<br />
source \$HOME/.virtualenvs/jupyter/bin/activate<br />
export XDG_DATA_HOME=\$SCRATCH/.share<br />
export XDG_CACHE_HOME=\$SCRATCH/.cache<br />
export XDG_CONFIG_HOME=\$SCRATCH/.config<br />
export XDG_RUNTIME_DIR=\$SCRATCH/.runtime<br />
export JUPYTER_CONFIG_DIR=\$SCRATCH/.config/.jupyter<br />
jupyter \${1:-notebook} --ip \$(hostname -f) --no-browser --notebook-dir=\$PWD<br />
EOF<br />
$ chmod +x $HOME/.virtualenvs/jupyter/bin/notebook.sh<br />
</pre><br />
<br />
2. To run the jupyter server on a compute node, start an interactive session with the <tt>salloc</tt> command (debugjob would also work) and then launch the server:<br />
<br />
<pre><br />
$ salloc --time=2:00:00 -N 1 -n 40 # get one dedicated node for two hours with 40 cores.<br />
$ cd $SCRATCH # $HOME is read-only, so move to $SCRATCH<br />
$ $HOME/.virtualenvs/jupyter/bin/notebook.sh # add the argument "lab" to start with the jupyter lab<br />
</pre><br />
Make sure you note down (a) the name is of the compute node that you got allocated (the salloc command will let you know, they start with "<tt>nia</tt>" followed by a 4 digit number), and (b) the last URL that the notebook.sh tells you to use to connect.<br />
<br />
4. To connect to this jupyter server running on a compute node, which is not accessible from the internet, in a different terminal on you own computer, you must reconnect to niagara with a port-forwarding<br />
tunnel to the compute node on which jupyter is running:<br />
<br />
<pre><br />
$ ssh -L8888:niaXXXX:8888 USERNAME@niagara.scinet.utoronto.ca -N<br />
</pre><br />
<br />
where <tt>niaXXXX</tt> is the name of the compute node (point (a) above), and <tt>USERNAME</t> should be your Compute Canada username. This command should just "hang" there, it only serves to forward port number 8888 to port 8888 on the compute node. <br />
<br />
Finally, point your browser to the URL that the <tt>notebook.sh</tt> command printed out (point (b) above), i.e., the one with 127.0.0.1 in it.<br />
<br />
= Producing Matplotlib Figures on Niagara Compute Nodes and in Job Scripts =<br />
<br />
The conventional way of producing figures from python using matplotlib i.e., <br />
<br />
import matplotlib.pyplot as plt<br />
plt.plot(.....)<br />
plt.savefig(...)<br />
<br />
will not work on the Niagara compute nodes. The reason is that pyplot will try to open the figure in a window on the screen, but the compute nodes do not have screens or window managers. There is an easy workaround, however, that sets up a different 'backend' to matplotlib, one that does not try to open a window, as follows:<br />
<br />
import matplotlib as mpl<br />
mpl.use('Agg')<br />
import matplotlib.pyplot as plt<br />
plt.plot(.....)<br />
plt.savefig(...)<br />
<br />
It is essential that the <tt>mpl.use('Agg')</tt> command precedes the importing of pyplot.<br />
<br />
= Using mpi4py =<br />
<br />
Several of the Python installations contain mpi4py preinstalled. However, using mpi4py requires loading an MPI module. There are several combinations of compiler/MPI/python modules which can be used.<br />
<br />
== Using Regular Python ==<br />
<br />
The Python in the regular <tt>python</tt> module (compiled from source) does not come with mpi4py. You will need to install mpi4py in your own storage space, preferably in a virtual environment.<br />
<br />
$ module load NiaEnv/2019b gcc/8.3.0 intelmpi/2019u5 python/3.6.8<br />
$ virtualenv --system-site-packages ~/.virtualenvs/mpi4pyenv<br />
$ source ~/.virtualenvs/mpi4pyenv/bin/activate<br />
(mpi4pyenv)$ pip install mpi4py<br />
<br />
== Using intelpython ==<br />
<br />
Using the either the NiaEnv/2019b or NiaEnv/2018a stack (the most-recent software stack is always recommended), the intelpython modules all have mpi4py, and should all work if an MPI module is also loaded. An example of this, using the NiaEnv/2019b stack, might be<br />
<br />
$ module load NiaEnv/2019b<br />
$ module load intel/2019u4 intelmpi/2019u4<br />
$ module load intelpython3/2019u4<br />
<br />
Other combinations of compilers (intel/gcc) or MPI module (intelmpi/openmpi) will also work with intelpython.<br />
<br />
== Using Anaconda ==<br />
<br />
Under the NiaEnv/2018a stack anaconda is available as a module. This module does not come with mpi4py, but can be installed using the usual steps: <br />
<br />
$ module load gcc/7.3.0 openmpi/3.1.1<br />
$ module load anaconda3/2018.12<br />
$<br />
$ conda create -n myenv<br />
$ <br />
$ source activate myenv<br />
(myenv) $ <br />
(myenv) $ conda install mpi4py<br />
(myenv) $<br />
<br />
<br />
== Error messages ==<br />
<br />
When using openmpi with mpi4py, you may get an error of this type:<br />
<br />
pml_ucx.c:285 Error: UCP worker does not support MPI_THREAD_MULTIPLE<br />
<br />
Add the following lines to your Python script, BEFORE you import the mpi4py package:<br />
<br />
import mpi4py.rc<br />
mpi4py.rc.threads = False<br />
<br />
Alternatively, you can edit the __init__.py file in your virtualenv's mpi4py directory (venv/lib/python3.8/site-packages/mpi4py for example), and change the 'thread_level' to 'funneled':<br />
<br />
thread_level = 'funneled'<br />
<br />
Which should change the level of mpi4py's thread support.<br />
<br />
= SciNet's Python Classes =<br />
<br />
There is a dizzying amount of documentation available for programming in Python on the [http://python.org/ Python.org webpage]. That begin said, each fall, SciNet runs two 4-week classes on using Python for research:<br />
* [https://support.scinet.utoronto.ca/education/browse.php?category=-1&search=scmp142&include=all&filter=Filter SCMP142]: Introduction to Programming with Python. This class is intended for those with little-to-no programming experience who wish to learn how to program.<br />
* [https://support.scinet.utoronto.ca/education/browse.php?category=-1&search=scmp112&include=all&filter=Filter SCMP112]: Introduction to Scientific Computing with Python. This class focusses on using Python to perform research computing.<br />
<br />
An excellent set of material for teaching scientists to program in Python is also available at the [https://v4.software-carpentry.org/python/index.html Software Carpentry homepage].</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3330
Scinet-conda
2021-11-12T16:44:30Z
<p>Ejspence: /* Activating your scinet-conda environment */</p>
<hr />
<div><br />
The scinet-conda wrapper is an EXPERIMENTAL package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating scinet-conda ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Managing conda environments ==<br />
<br />
You can now create scinet-conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n myvenv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create a Singularity container which contains Anaconda, in your $HOME directory. <br />
<br />
To list the existing scinet-conda environments use the command<br />
<br />
$ scinet-conda env list<br />
myvenv<br />
$<br />
<br />
To delete an existing scinet-conda environment you can use the command<br />
<br />
$ scinet-conda env remove -n myvenv<br />
<br />
== Activating your scinet-conda environment ==<br />
<br />
To activate a scinet-conda environment, use the command:<br />
<br />
$ scinet-conda activate myvenv<br />
(myvenv)$<br />
<br />
This environment can now be used in the same way as a regular Python virtual environment. Similarly, the environment can be deactivated thus<br />
<br />
(myvenv)$ scinet-conda deactivate<br />
$<br />
<br />
== Installing packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(myvenv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(myvenv)$ scinet-conda install --channel dglteam dgl<br />
<br />
<br />
If you need to install a package that is not available through conda channels you can always use pip<br />
<br />
(myvenv)$ scinet-conda pip install nose<br />
<br />
== Accessing the Singularity container ==<br />
<br />
For some packages you need access to the internal environment of the Singularity container itself. This is access using the 'shell' command<br />
<br />
(myvenv)$ scinet-conda shell</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3329
Scinet-conda
2021-11-12T16:43:55Z
<p>Ejspence: /* Managing conda environments */</p>
<hr />
<div><br />
The scinet-conda wrapper is an EXPERIMENTAL package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating scinet-conda ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Managing conda environments ==<br />
<br />
You can now create scinet-conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n myvenv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create a Singularity container which contains Anaconda, in your $HOME directory. <br />
<br />
To list the existing scinet-conda environments use the command<br />
<br />
$ scinet-conda env list<br />
myvenv<br />
$<br />
<br />
To delete an existing scinet-conda environment you can use the command<br />
<br />
$ scinet-conda env remove -n myvenv<br />
<br />
== Activating your scinet-conda environment ==<br />
<br />
To activate a scinet-conda environment, use the command:<br />
<br />
$ scinet-conda activate myvenv<br />
(myvenv)$<br />
<br />
This environment can now be used in the same way as a regular virtual environment. Similarly, the environment can be deactivated thus<br />
<br />
(myvenv)$ scinet-conda deactivate<br />
$<br />
<br />
== Installing packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(myvenv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(myvenv)$ scinet-conda install --channel dglteam dgl<br />
<br />
<br />
If you need to install a package that is not available through conda channels you can always use pip<br />
<br />
(myvenv)$ scinet-conda pip install nose<br />
<br />
== Accessing the Singularity container ==<br />
<br />
For some packages you need access to the internal environment of the Singularity container itself. This is access using the 'shell' command<br />
<br />
(myvenv)$ scinet-conda shell</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3328
Scinet-conda
2021-11-12T16:23:41Z
<p>Ejspence: </p>
<hr />
<div><br />
The scinet-conda wrapper is an EXPERIMENTAL package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating scinet-conda ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Managing conda environments ==<br />
<br />
You can now create conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n myvenv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create an Singularity, container which contains Anaconda, in your $HOME directory. <br />
<br />
To list the existing scinet-conda environment use the command<br />
<br />
$ scinet-conda env list<br />
myvenv<br />
$<br />
<br />
To delete an existing scinet-conda environment you can use the command<br />
<br />
$ scinet-conda env remove -n myvenv<br />
<br />
== Activating your scinet-conda environment ==<br />
<br />
To activate a scinet-conda environment, use the command:<br />
<br />
$ scinet-conda activate myvenv<br />
(myvenv)$<br />
<br />
This environment can now be used in the same way as a regular virtual environment. Similarly, the environment can be deactivated thus<br />
<br />
(myvenv)$ scinet-conda deactivate<br />
$<br />
<br />
== Installing packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(myvenv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(myvenv)$ scinet-conda install --channel dglteam dgl<br />
<br />
<br />
If you need to install a package that is not available through conda channels you can always use pip<br />
<br />
(myvenv)$ scinet-conda pip install nose<br />
<br />
== Accessing the Singularity container ==<br />
<br />
For some packages you need access to the internal environment of the Singularity container itself. This is access using the 'shell' command<br />
<br />
(myvenv)$ scinet-conda shell</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3327
Scinet-conda
2021-11-12T16:22:44Z
<p>Ejspence: /* Activating your scinet-conda environment */</p>
<hr />
<div><br />
The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating scinet-conda ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Managing conda environments ==<br />
<br />
You can now create conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n myvenv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create an Singularity, container which contains Anaconda, in your $HOME directory. <br />
<br />
To list the existing scinet-conda environment use the command<br />
<br />
$ scinet-conda env list<br />
myvenv<br />
$<br />
<br />
To delete an existing scinet-conda environment you can use the command<br />
<br />
$ scinet-conda env remove -n myvenv<br />
<br />
== Activating your scinet-conda environment ==<br />
<br />
To activate a scinet-conda environment, use the command:<br />
<br />
$ scinet-conda activate myvenv<br />
(myvenv)$<br />
<br />
This environment can now be used in the same way as a regular virtual environment. Similarly, the environment can be deactivated thus<br />
<br />
(myvenv)$ scinet-conda deactivate<br />
$<br />
<br />
== Installing packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(myvenv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(myvenv)$ scinet-conda install --channel dglteam dgl<br />
<br />
<br />
If you need to install a package that is not available through conda channels you can always use pip<br />
<br />
(myvenv)$ scinet-conda pip install nose<br />
<br />
== Accessing the Singularity container ==<br />
<br />
For some packages you need access to the internal environment of the Singularity container itself. This is access using the 'shell' command<br />
<br />
(myvenv)$ scinet-conda shell</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3326
Scinet-conda
2021-11-12T16:21:54Z
<p>Ejspence: /* Activating scinet-conda */</p>
<hr />
<div><br />
The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating scinet-conda ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Managing conda environments ==<br />
<br />
You can now create conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n myvenv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create an Singularity, container which contains Anaconda, in your $HOME directory. <br />
<br />
To list the existing scinet-conda environment use the command<br />
<br />
$ scinet-conda env list<br />
myvenv<br />
$<br />
<br />
To delete an existing scinet-conda environment you can use the command<br />
<br />
$ scinet-conda env remove -n myvenv<br />
<br />
== Activating your scinet-conda environment ==<br />
<br />
To activate a scinet-conda environment, use the command:<br />
<br />
$ scinet-conda activate myvenv<br />
(myenv)$<br />
<br />
This environment can now be used in the same way as a regular virtual environment.<br />
<br />
== Installing packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(myvenv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(myvenv)$ scinet-conda install --channel dglteam dgl<br />
<br />
<br />
If you need to install a package that is not available through conda channels you can always use pip<br />
<br />
(myvenv)$ scinet-conda pip install nose<br />
<br />
== Accessing the Singularity container ==<br />
<br />
For some packages you need access to the internal environment of the Singularity container itself. This is access using the 'shell' command<br />
<br />
(myvenv)$ scinet-conda shell</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3325
Scinet-conda
2021-11-12T16:21:32Z
<p>Ejspence: /* Accessing the Singularity container */</p>
<hr />
<div><br />
The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating scinet-conda ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Managing conda environments ==<br />
<br />
You can now create conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n myvenv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create an Singularity, container which contains Anaconda, in your $HOME directory. <br />
<br />
To list the existing scinet-conda environment use the command<br />
<br />
$ scinet-conda env list<br />
myvenv<br />
$<br />
<br />
To delete an existing scinet-conda environment you can use the command<br />
<br />
$ scinet-conda env remove -n myvenv<br />
<br />
== Activating your scinet-conda environment ==<br />
<br />
To activate a scinet-conda environment, use the command:<br />
<br />
$ scinet-conda activate myvenv<br />
(myenv)$<br />
<br />
This environment can now be used in the same way as a regular virtual environment.<br />
<br />
== Installing packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(myvenv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(myvenv)$ scinet-conda install --channel dglteam dgl<br />
<br />
<br />
If you need to install a package that is not available through conda channels you can always use pip<br />
<br />
(myvenv)$ scinet-conda pip install nose<br />
<br />
== Accessing the Singularity container ==<br />
<br />
For some packages you need access to the internal environment of the Singularity container itself. This is access using the 'shell' command<br />
<br />
(myvenv)$ scinet-conda shell</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3324
Scinet-conda
2021-11-12T16:21:16Z
<p>Ejspence: /* Installing packages */</p>
<hr />
<div><br />
The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating scinet-conda ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Managing conda environments ==<br />
<br />
You can now create conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n myvenv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create an Singularity, container which contains Anaconda, in your $HOME directory. <br />
<br />
To list the existing scinet-conda environment use the command<br />
<br />
$ scinet-conda env list<br />
myvenv<br />
$<br />
<br />
To delete an existing scinet-conda environment you can use the command<br />
<br />
$ scinet-conda env remove -n myvenv<br />
<br />
== Activating your scinet-conda environment ==<br />
<br />
To activate a scinet-conda environment, use the command:<br />
<br />
$ scinet-conda activate myvenv<br />
(myenv)$<br />
<br />
This environment can now be used in the same way as a regular virtual environment.<br />
<br />
== Installing packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(myvenv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(myvenv)$ scinet-conda install --channel dglteam dgl<br />
<br />
<br />
If you need to install a package that is not available through conda channels you can always use pip<br />
<br />
(myvenv)$ scinet-conda pip install nose<br />
<br />
== Accessing the Singularity container ==<br />
<br />
For some packages you need access to the internal environment of the Singularity container itself. This is access using the 'shell' command<br />
<br />
(my.venv)$ scinet-conda shell</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3323
Scinet-conda
2021-11-12T16:20:47Z
<p>Ejspence: /* Activating your scinet-conda environment */</p>
<hr />
<div><br />
The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating scinet-conda ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Managing conda environments ==<br />
<br />
You can now create conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n myvenv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create an Singularity, container which contains Anaconda, in your $HOME directory. <br />
<br />
To list the existing scinet-conda environment use the command<br />
<br />
$ scinet-conda env list<br />
myvenv<br />
$<br />
<br />
To delete an existing scinet-conda environment you can use the command<br />
<br />
$ scinet-conda env remove -n myvenv<br />
<br />
== Activating your scinet-conda environment ==<br />
<br />
To activate a scinet-conda environment, use the command:<br />
<br />
$ scinet-conda activate myvenv<br />
(myenv)$<br />
<br />
This environment can now be used in the same way as a regular virtual environment.<br />
<br />
== Installing packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(my.venv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(my.venv)$ scinet-conda install --channel dglteam dgl<br />
<br />
<br />
If you need to install a package that is not available through conda channels you can always use pip<br />
<br />
(my.venv)$ scinet-conda pip install nose<br />
<br />
== Accessing the Singularity container ==<br />
<br />
For some packages you need access to the internal environment of the Singularity container itself. This is access using the 'shell' command<br />
<br />
(my.venv)$ scinet-conda shell</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3322
Scinet-conda
2021-11-12T16:19:43Z
<p>Ejspence: /* Managing conda environments */</p>
<hr />
<div><br />
The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating scinet-conda ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Managing conda environments ==<br />
<br />
You can now create conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n myvenv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create an Singularity, container which contains Anaconda, in your $HOME directory. <br />
<br />
To list the existing scinet-conda environment use the command<br />
<br />
$ scinet-conda env list<br />
myvenv<br />
$<br />
<br />
To delete an existing scinet-conda environment you can use the command<br />
<br />
$ scinet-conda env remove -n myvenv<br />
<br />
== Activating your scinet-conda environment ==<br />
<br />
To activate a scinet-conda environment, use the command:<br />
<br />
$ scinet-conda activate my.venv<br />
(my.env)$<br />
<br />
This environment can now be used in the same way as a regular virtual environment.<br />
<br />
== Installing packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(my.venv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(my.venv)$ scinet-conda install --channel dglteam dgl<br />
<br />
<br />
If you need to install a package that is not available through conda channels you can always use pip<br />
<br />
(my.venv)$ scinet-conda pip install nose<br />
<br />
== Accessing the Singularity container ==<br />
<br />
For some packages you need access to the internal environment of the Singularity container itself. This is access using the 'shell' command<br />
<br />
(my.venv)$ scinet-conda shell</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3321
Scinet-conda
2021-11-12T16:16:47Z
<p>Ejspence: /* Managing conda environments */</p>
<hr />
<div><br />
The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating scinet-conda ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Managing conda environments ==<br />
<br />
You can now create conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n my.venv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create an Singularity, container which contains Anaconda, in your $HOME directory. <br />
<br />
To list the existing scinet-conda environment use the command<br />
<br />
$ scinet-conda env list<br />
my.venv<br />
$<br />
<br />
To delete an existing scinet-conda environment you can use the command<br />
<br />
$ scinet-conda env remove -n my.venv<br />
<br />
== Activating your scinet-conda environment ==<br />
<br />
To activate a scinet-conda environment, use the command:<br />
<br />
$ scinet-conda activate my.venv<br />
(my.env)$<br />
<br />
This environment can now be used in the same way as a regular virtual environment.<br />
<br />
== Installing packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(my.venv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(my.venv)$ scinet-conda install --channel dglteam dgl<br />
<br />
<br />
If you need to install a package that is not available through conda channels you can always use pip<br />
<br />
(my.venv)$ scinet-conda pip install nose<br />
<br />
== Accessing the Singularity container ==<br />
<br />
For some packages you need access to the internal environment of the Singularity container itself. This is access using the 'shell' command<br />
<br />
(my.venv)$ scinet-conda shell</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3320
Scinet-conda
2021-11-12T16:15:04Z
<p>Ejspence: /* Activating the environment */</p>
<hr />
<div><br />
The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating scinet-conda ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Managing conda environments ==<br />
<br />
You can now create conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n my.venv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create an Singularity, container which contains Anaconda, in your $HOME directory. To activate this environment, use the command:<br />
<br />
$ scinet-conda activate my.venv<br />
(my.env)$<br />
<br />
This environment can now be used in the same way as a regular virtual environment.<br />
<br />
To list the existing scinet-conda environment use the command<br />
<br />
$ scinet-conda env list<br />
my.venv<br />
$<br />
<br />
To delete an existing scinet-conda environment you can use the command<br />
<br />
$ scinet-conda env remove -n my.venv<br />
<br />
== Installing packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(my.venv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(my.venv)$ scinet-conda install --channel dglteam dgl<br />
<br />
<br />
If you need to install a package that is not available through conda channels you can always use pip<br />
<br />
(my.venv)$ scinet-conda pip install nose<br />
<br />
== Accessing the Singularity container ==<br />
<br />
For some packages you need access to the internal environment of the Singularity container itself. This is access using the 'shell' command<br />
<br />
(my.venv)$ scinet-conda shell</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3319
Scinet-conda
2021-11-12T16:14:03Z
<p>Ejspence: /* Creating conda environments */</p>
<hr />
<div><br />
The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating the environment ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Managing conda environments ==<br />
<br />
You can now create conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n my.venv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create an Singularity, container which contains Anaconda, in your $HOME directory. To activate this environment, use the command:<br />
<br />
$ scinet-conda activate my.venv<br />
(my.env)$<br />
<br />
This environment can now be used in the same way as a regular virtual environment.<br />
<br />
To list the existing scinet-conda environment use the command<br />
<br />
$ scinet-conda env list<br />
my.venv<br />
$<br />
<br />
To delete an existing scinet-conda environment you can use the command<br />
<br />
$ scinet-conda env remove -n my.venv<br />
<br />
== Installing packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(my.venv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(my.venv)$ scinet-conda install --channel dglteam dgl<br />
<br />
<br />
If you need to install a package that is not available through conda channels you can always use pip<br />
<br />
(my.venv)$ scinet-conda pip install nose<br />
<br />
== Accessing the Singularity container ==<br />
<br />
For some packages you need access to the internal environment of the Singularity container itself. This is access using the 'shell' command<br />
<br />
(my.venv)$ scinet-conda shell</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3318
Scinet-conda
2021-11-12T16:11:47Z
<p>Ejspence: /* Installing conda packages */</p>
<hr />
<div><br />
The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating the environment ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Creating conda environments ==<br />
<br />
You can now create conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n my.venv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create an Singularity, container which contains Anaconda, in your $HOME directory. To activate this environment, use the command:<br />
<br />
$ scinet-conda activate my.venv<br />
(my.env)$<br />
<br />
This environment can now be used in the same way as a regular virtual environment.<br />
<br />
== Installing packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(my.venv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(my.venv)$ scinet-conda install --channel dglteam dgl<br />
<br />
<br />
If you need to install a package that is not available through conda channels you can always use pip<br />
<br />
(my.venv)$ scinet-conda pip install nose<br />
<br />
== Accessing the Singularity container ==<br />
<br />
For some packages you need access to the internal environment of the Singularity container itself. This is access using the 'shell' command<br />
<br />
(my.venv)$ scinet-conda shell</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3317
Scinet-conda
2021-11-12T16:09:03Z
<p>Ejspence: /* How to use scinet-cond */</p>
<hr />
<div><br />
The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-conda =<br />
<br />
== Activating the environment ==<br />
<br />
By default scinet-conda is not available. It can be put into your environment using the command<br />
<br />
$ source /scinet/conda/etc/profile.d/scinet-conda.sh<br />
<br />
This command must be invoked at the beginning of each session to make the necessary commands available.<br />
<br />
== Creating conda environments ==<br />
<br />
You can now create conda environments using a similar syntax to regular Anaconda:<br />
<br />
$ scinet-conda create -n my.venv python==3.7<br />
<br />
Note that you must specify the Python version that you want. This will create an Singularity, container which contains Anaconda, in your $HOME directory. To activate this environment, use the command:<br />
<br />
$ scinet-conda activate my.venv<br />
(my.env)$<br />
<br />
This environment can now be used in the same way as a regular virtual environment.<br />
<br />
== Installing conda packages ==<br />
<br />
Once the environment has been activated conda packages can be installed in your scinet-conda environment in the same manner as a regular conda environment.<br />
<br />
(my.venv)$ scinet-conda install six<br />
<br />
To install a package from a specific channel, specify the channel in the usual way.<br />
<br />
(my.venv)$ scinet-conda install --channel dglteam dgl</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3316
Scinet-conda
2021-11-12T16:02:25Z
<p>Ejspence: /* How to use scinet-cond */</p>
<hr />
<div><br />
The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-cond =<br />
<br />
By default scinet-conda is not<br />
<br />
$ ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Scinet-conda&diff=3315
Scinet-conda
2021-11-12T16:01:02Z
<p>Ejspence: Created page with " The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creat..."</p>
<hr />
<div><br />
The scinet-conda wrapper is a package created to address Anaconda's feature of creating zillions of files upon installation, which is not good for HPC file systems. It creates a virtual environment-like interface to Anaconda installed in a Singularity container, thus eliminating the problem of have numerous small files.<br />
<br />
= How to use scinet-cond =<br />
<br />
By default scinet-conda is not</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Niagara_Quickstart&diff=3229
Niagara Quickstart
2021-09-27T18:59:00Z
<p>Ejspence: /* Submitting jobs */</p>
<hr />
<div>{{Infobox Computer<br />
|image=[[Image:Niagara.jpg|center|300px|thumb]]<br />
|name=Niagara<br />
|installed=Jan 2018/March 2020<br />
|operatingsystem= CentOS 7.6 <br />
|loginnode= niagara.scinet.utoronto.ca<br />
|nnodes= 2,024 nodes (80,960 cores)<br />
|rampernode=188 GiB / 202 GB <br />
|corespernode=40 (80 hyperthreads)<br />
|interconnect=Mellanox Dragonfly+<br />
|vendorcompilers= icc (C) ifort (fortran) icpc (C++)<br />
|queuetype=Slurm<br />
}}<br />
<br />
=Specifications=<br />
<br />
The Niagara cluster is a large cluster of 2,024 Lenovo SD530 servers each with 40 Intel "Skylake" at 2.4 GHz or 40 Intel "CascadeLake" cores at 2.5 GHz. <br />
The peak performance of the cluster is about 3.6 PFlops (6.25 PFlops theoretical). It was the 53rd fastest supercomputer on the [https://www.top500.org/list/2018/06/?page=1 TOP500 list of June 2018], and is at number 113 on the [https://www.top500.org/lists/top500/list/2021/06/ current list (June 2021)]. <br />
<br />
Each node of the cluster has 188 GiB / 202 GB RAM per node (at least 4 GiB/core for user jobs). Being designed for large parallel workloads, it has a fast interconnect consisting of EDR InfiniBand in a Dragonfly+ topology with Adaptive Routing. The compute nodes are accessed through a queueing system that allows jobs with a minimum of 15 minutes and a maximum of 24 hours and favours large jobs.<br />
<br />
* See the [https://www.youtube.com/watch?v=l-E2CFGh0BE&feature=youtu.be "Intro to Niagara"] recording<br />
<br />
More detailed hardware characteristics of the Niagara supercomputer can be found [https://docs.computecanada.ca/wiki/Niagara on this page].<br />
<br />
Note: Documentation about the "GPU expansion to Niagara" called "Mist" can be found on [[Mist | its own page]].<br />
<br />
= Getting started on Niagara =<br />
<br />
Access to Niagara is not enabled automatically for everyone with a Compute Canada account, but anyone with an active Compute Canada account can get their access enabled.<br />
<br />
If you have an active Compute Canada account but you do not have access to Niagara yet (e.g. because you are new to SciNet or belong to a group whose primary PI does not have an allocation as granted in the annual [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions Compute Canada RAC]), go to the [https://ccdb.computecanada.ca/services/opt_in opt-in page on the CCDB site]. After clicking the "Join" button, it usually takes only one or two business days for access to be granted. <br />
<br />
Please read this document carefully. The [https://docs.scinet.utoronto.ca/index.php/FAQ FAQ] is also a useful resource. If at any time you require assistance, or if something is unclear, please do not hesitate to [mailto:support@scinet.utoronto.ca contact us].<br />
<br />
== Logging in ==<br />
<br />
Niagara runs CentOS 7, which is a type of Linux. You will need to be familiar with Linux systems to work on Niagara. If you are not it will be worth your time to review our [https://support.scinet.utoronto.ca/education/browse.php?category=-1&search=scmp101&include=all&filter=Filter Introduction to Linux Shell] class.<br />
<br />
As with all SciNet and CC (Compute Canada) compute systems, access to Niagara is done via [[SSH]] (secure shell) only. Open a terminal window (e.g. Connecting with [https://docs.computecanada.ca/wiki/Connecting_with_PuTTY PuTTY] on Windows or Connecting with [https://docs.computecanada.ca/wiki/Connecting_with_MobaXTerm MobaXTerm]), then SSH into the Niagara login nodes with your CC credentials:<br />
<br />
$ ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
<br />
or<br />
<br />
$ ssh -Y MYCCUSERNAME@niagara.computecanada.ca<br />
<br />
The first time you login to Niagara, please make sure to check if the login node ssh host key fingerprint<br />
matches. [[SSH_Changes_in_May_2019 | See here how]].<br />
<br />
* The Niagara login nodes are where you develop, edit, compile, prepare and submit jobs.<br />
* These login nodes are not part of the Niagara compute cluster, but have the same architecture, operating system, and software stack.<br />
* The optional <code>-Y</code> is needed to open windows from the Niagara command-line onto your local X server.<br />
* You can only connect 4 times in a 2-minute window to the login nodes. <br />
* To run on Niagara's compute nodes, you must [[#Submitting_jobs | submit a batch job]].<br />
<br />
If you cannot log in, be sure to first check the [https://docs.scinet.utoronto.ca System Status] on this site's front page.<br />
<br />
== Your various directories ==<br />
<br />
By virtue of your access to Niagara you are granted storage space on the system. There are several directories available to you, each indicated by an associated environment variable.<br />
<br />
=== home and scratch ===<br />
<br />
You have a home and scratch directory on the system, the paths to which are stored in the environment variables $HOME and $SCRATCH. The locations are of the form<br />
<br />
$HOME=/home/g/groupname/myccusername<br />
$SCRATCH=/scratch/g/groupname/myccusername<br />
<br />
where groupname is the name of your PI's group, and myccusername is your CC username. For example:<br />
<br />
nia-login07:~$ pwd<br />
/home/s/scinet/rzon<br />
nia-login07:~$ cd $SCRATCH<br />
nia-login07:rzon$ pwd<br />
/scratch/s/scinet/rzon<br />
<br />
NOTE: home is read-only on compute nodes.<br />
<br />
=== project and archive/nearline ===<br />
<br />
Users from groups with [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions RAC storage allocation] will also have a project directory and possible an archive (a.k.a. "nearline") directory, the paths to which are stored in the environment variables $PROJECT and $ARCHIVE. They follow the naming convention:<br />
<br />
$PROJECT=/project/g/groupname/myccusername<br />
$ARCHIVE=/archive/g/groupname/myccusername<br />
<br />
NOTE: Currently archive space is available only via [[HPSS]], and is not accessible on the Niagara login, compute, or datamover nodes.<br />
<br />
'''''IMPORTANT: Future-proof your scripts'''''<br />
<br />
When writing your scripts, use the environment variables (<tt>$HOME</tt>, <tt>$SCRATCH</tt>, <tt>$PROJECT</tt>, <tt>$ARCHIVE</tt>) instead of the actual paths! The paths may change in the future.<br />
<br />
=== Storage and quotas ===<br />
<br />
You should familiarize yourself with the [[Data_Management#Purpose_of_each_file_system | various file systems]], what purpose they serve, and how to properly use them. This table summarizes the various file systems. See the [[Data_Management | Data Management]] page for more details.<br />
<br />
{| class="wikitable"<br />
! location<br />
!colspan="2"| quota<br />
!align="right"| block size<br />
! expiration time<br />
! backed up<br />
! on login nodes<br />
! on compute nodes<br />
|-<br />
| $HOME<br />
|colspan="2"| 100 GB / 250,000 files per user<br />
|align="right"| 1 MB<br />
| <br />
| yes<br />
| yes<br />
| read-only<br />
|-<br />
|rowspan="2"| $SCRATCH<br />
|colspan="2"| 25 TB / 6,000,000 file per user<br />
|align="right" rowspan="2" | 16 MB<br />
|rowspan="2"| 2 months<br />
|rowspan="2"| no<br />
|rowspan="2"| yes<br />
|rowspan="2"| yes<br />
|-<br />
|align="right"|50-500TB per group<br />
|align="right"|[[Data_Management#Quotas_and_purging | depending on group size]]<br />
|-<br />
| $PROJECT<br />
|colspan="2"| by group allocation<br />
|align="right"| 16 MB<br />
| <br />
| yes<br />
| yes<br />
| yes<br />
|-<br />
| $ARCHIVE<br />
|colspan="2"| by group (nearline) allocation<br />
|align="right"| <br />
|<br />
| dual-copy<br />
| no<br />
| no<br />
|-<br />
| $BBUFFER<br />
|colspan="2"| 10 TB per user<br />
|align="right"| 1 MB<br />
| very short<br />
| no<br />
| yes<br />
| yes<br />
|}<br />
<br />
=== Moving data to Niagara ===<br />
<br />
If you need to move data to Niagara for analysis, or when you need to move data off of Niagara, use the following guidelines:<br />
* If your data is less than 10GB, move the data using the login nodes.<br />
* If your data is greater than 10GB, move the data using the datamover nodes nia-datamover1.scinet.utoronto.ca and nia-datamover2.scinet.utoronto.ca .<br />
<br />
Details of how to use the datamover nodes can be found on the [[Data_Management#Moving_data | Data Management ]] page.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on Niagara: use existing software, or [[Niagara_Quickstart#Compiling_on_Niagara:_Example | compile your own]]. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]].<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
== Software stacks: NiaEnv and CCEnv ==<br />
<br />
On Niagara, there are two available software stacks:<br />
<br />
=== NiaEnv ===<br />
<br />
A [https://docs.scinet.utoronto.ca/index.php/Modules_specific_to_Niagara Niagara software stack] tuned and compiled for this machine. This stack is available by default, but if not, can be reloaded with<br />
<pre>module load NiaEnv</pre><br />
This loads the default (set of modules), which is currently the 2019b epoch. Before September 1, the default was NiaEnv/2018a. Users are encourage to use the 2019b stack, but to make sure old job scripts or older software installations in your home directory continue to work, you may need to use<br />
<pre>module load NiaEnv/2018a</pre><br />
You can override the system default for the epoch version by creating a file called <b><tt>.modulerc</tt></b> in your home directory with the line <b><tt>module-version NiaEnv/VERSION default</tt></b>, e.g. like so:<br />
<pre><br />
echo "module-version NiaEnv/2019b default" > $HOME/.modulerc<br />
</pre><br />
After this, subsequent logins and jobs will use the 2019b stack even when the system default is different.<br />
<p>Similarly, you can make an older epoch your personal default, like so<br />
<pre><br />
echo "module-version NiaEnv/2018a default" > $HOME/.modulerc<br />
</pre><br />
<br />
No modules are loaded by default on Niagara except NiaEnv.<br />
<br />
=== CCEnv ===<br />
<br />
The same [https://docs.computecanada.ca/wiki/Modules software stack available on Compute Canada's General Purpose clusters] [https://docs.computecanada.ca/wiki/Graham Graham] and [https://docs.computecanada.ca/wiki/Cedar Cedar] can be used on Niagara too, with:<br />
<pre>module load CCEnv</pre><br />
Or, if you want the same default modules loaded as on Béluga, then do<br />
<pre>module load CCEnv StdEnv</pre><br />
or, if you want the same default modules loaded as on Cedar and Graham, do<br />
<pre>module load CCEnv arch/avx2 StdEnv</pre><br />
<br />
== Tips for loading software ==<br />
<br />
* We advise '''''against''''' loading modules in your .bashrc. This can lead to very confusing behaviour under certain circumstances. Our guidelines for .bashrc files can be found [[bashrc guidelines|here]].<br />
* Instead, load modules by hand when needed, or by sourcing a separate script.<br />
* Load run-specific modules inside your job submission script.<br />
* Short names give default versions; e.g. <code>intel</code> → <code>intel/2018.2</code>. It is usually better to be explicit about the versions, for future reproducibility.<br />
* Modules often require other modules to be loaded first. Solve these dependencies by using [[Using_modules#Module_spider | <code>module spider</code>]].<br />
<br />
= Available compilers and interpreters =<br />
<br />
* For most compiled software, one should use the Intel compilers (<tt>icc</tt> for C, <tt>icpc</tt> for C++, and <tt>ifort</tt> for Fortran). Loading an <tt>intel</tt> module makes these available. <br />
* The GNU compiler suite (<tt>gcc, g++, gfortran</tt>) is also available, if you load one of the <tt>gcc</tt> modules.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> or <tt>intelmpi</tt> module.<br />
* Open source interpreted, interactive software is also available:<br />
** [[Python]]<br />
** [[R]]<br />
** Julia<br />
** [[Octave]]<br />
<br />
Please visit the corresponding page for details on using these tools. For information on running MATLAB applications on Niagara, visit [[MATLAB| this page]].<br />
<br />
= Using Commercial Software =<br />
<br />
May I use commercial software on Niagara?<br />
* Possibly, but you have to bring your own license for it. You can connect to an external license server using [[SSH_Tunneling | ssh tunneling]].<br />
* SciNet and Compute Canada have an extremely large and broad user base of thousands of users, so we cannot provide licenses for everyone's favorite software.<br />
* Thus, the only freely available commercial software installed on Niagara is software that can benefit everyone: Compilers, math libraries and debuggers.<br />
* That means no [[MATLAB]], Gaussian, IDL, <br />
* Open source alternatives like Octave, [[Python]], and [[R]] are available.<br />
* We are happy to help you to install commercial software for which you have a license.<br />
* In some cases, if you have a license, you can use software in the Compute Canada stack.<br />
The list of commercial software which is installed on Niagara, for which you will need a license to use, can be found on the [[Commercial_software | commercial software page]].<br />
<br />
= Compiling on Niagara: Example =<br />
<br />
Suppose one wants to compile an application from two c source files, appl.c and module.c, which use the Math Kernel Library. This is an example of how this would be done:<br />
<source lang="bash"><br />
nia-login07:~$ module load NiaEnv/2019b<br />
nia-login07:~$ module list<br />
Currently Loaded Modules:<br />
1) NiaEnv/2019b (S)<br />
Where:<br />
S: Module is Sticky, requires --force to unload or purge<br />
<br />
nia-login07:~$ module load intel/2019u4<br />
<br />
nia-login07:~$ ls<br />
appl.c module.c<br />
<br />
nia-login07:~$ icc -c -O3 -xHost -o appl.o appl.c<br />
nia-login07:~$ icc -c -O3 -xHost -o module.o module.c<br />
nia-login07:~$ icc -o appl module.o appl.o -mkl<br />
<br />
nia-login07:~$ ./appl<br />
</source><br />
Note:<br />
* The optimization flags -O3 -xHost allow the Intel compiler to use instructions specific to the architecture CPU that is present (instead of for more generic x86_64 CPUs).<br />
* Linking with the Intel Math Kernel Library (MKL) is easy when using the intel compiler, it just requires the -mkl flags.<br />
* If compiling with gcc, the optimization flags would be -O3 -march=native. For the way to link with the MKL, it is suggested to use the [https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor MKL link line advisor].<br />
<br />
= Testing and Debugging =<br />
<br />
You really should test your code before you submit it to the cluster to know if your code is correct and what kind of resources you need.<br />
* Small test jobs can be run on the login nodes. Rule of thumb: tests should run no more than a couple of minutes, taking at most about 1-2GB of memory, and use no more than a couple of cores.<br />
* You can run the [[Parallel Debugging with DDT|DDT]] debugger on the login nodes after <code>module load ddt</code>.<br />
* Short tests that do not fit on a login node, or for which you need a dedicated node, request an interactive debug job with the debug command:<br />
nia-login07:~$ debugjob --clean N<br />
where N is the number of nodes, If N=1, this gives an interactive session one 1 hour, when N=4 (the maximum), it gives you 22 minutes. The <tt>--clean</tt> argument is optional but recommended as it will start the session without any modules loaded, thus mimicking more closely what happens when you submit a job script.<br />
<br />
Finally, if your debugjob process takes more than 1 hour, you can request an interactive job from the regular queue using the salloc command. Note, however, that this may take some time to run, since it will be part of the regular queue, and will be run when the scheduler decides.<br />
nia-login07:~$ salloc --nodes N --time=M:00:00 --x11<br />
where N is again the number of nodes, and M is the number of hours you wish the job to run.<br />
The <tt>--x11</tt> is required if you need to use graphics while testing your code through salloc, e.g. when using a debugger such as [[Parallel Debugging with DDT|DDT]] or DDD, See the [[Testing_With_Graphics | Testing with graphics]] page for the options in that case.<br />
<br />
= Submitting jobs =<br />
<br />
<!-- == Progressive approach to run jobs on niagara == --><br />
<!-- We would like to emphasize the need for users to adopt a more progressive and explicit approach for testing, running and scaling up of jobs on niagara. [[Progressive_Approach | '''Here is a set of steps we suggest that you follow.''']] --><br />
<br />
Once you have compiled and tested your code or workflow on the Niagara login nodes, and confirmed that it behaves correctly, you are ready to submit jobs to the cluster. Your jobs will run on some of Niagara's 1548 compute nodes. When and where your job runs is determined by the scheduler.<br />
<br />
Niagara uses SLURM as its job scheduler. More-advanced details of how to interact with the scheduler can be found on the [[Slurm | Slurm page]].<br />
<br />
You submit jobs from a login node by passing a script to the sbatch command:<br />
<br />
nia-login07:scratch$ sbatch jobscript.sh<br />
<br />
This puts the job in the queue. It will run on the compute nodes in due course. Note that you must submit your job from a login node. You cannot submit jobs from the datamover nodes.<br />
<br />
In most cases, you should not submit from your $HOME directory, but rather, from your $SCRATCH directory, so that the output of your compute job can be written out (as mentioned above, $HOME is read-only on the compute nodes).<br />
<br />
Jobs will run under your group's RRG allocation, or, if the your group has none, under a RAS allocation (previously called `default' allocation).<br />
<br />
Some example job scripts can be found below.<br />
<br />
Keep in mind:<br />
* Scheduling is by node, so in multiples of 40 cores.<br />
* Your job's maximum walltime is 24 hours. <br />
* Jobs must write their output to your scratch or project directory (home is read-only on compute nodes).<br />
* Compute nodes have no internet access.<br />
* Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below). <br />
* [[Data_Management#Moving_data | Move your data]] to Niagara before you submit your job.<br />
<br />
== Scheduling by Node ==<br />
<br />
On many systems that use SLURM, the scheduler will deduce from the specifications of the number of tasks and the number of cpus-per-node what resources should be allocated. On Niagara things are a bit different.<br />
* All job resource requests on Niagara are scheduled as a multiple of '''nodes'''.<br />
* The nodes that your jobs run on are exclusively yours, for as long as the job is running on them.<br />
** No other users are running anything on them.<br />
** You can [[SSH]] into them to see how things are going.<br />
* Whatever your requests to the scheduler, it will always be translated into a multiple of nodes allocated to your job.<br />
* Memory requests to the scheduler are of no use. Your job always gets N x 202GB of RAM, where N is the number of nodes and 202GB is the amount of memory on the node.<br />
* If you run serial jobs you must still use all 40 cores on the node. Visit the [[Running_Serial_Jobs_on_Niagara | serial jobs]] page for examples of how to do this.<br />
* Since there are 40 cores per node, your job should use N x 40 cores. If you do not, we will contact you to help you optimize your workflow. Or you can [mailto:support@scinet.utoronto.ca contact us] to get assistance.<br />
<br />
== Limits ==<br />
<br />
There are limits to the size and duration of your jobs, the number of jobs you can run and the number of jobs you can have queued. It matters whether a user is part of a group with a [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ Resources for Research Group allocation] or not. It also matters in which 'partition' the job runs. 'Partitions' are SLURM-speak for use cases. You specify the partition with the <tt>-p</tt> parameter to <tt>sbatch</tt> or <tt>salloc</tt>, but if you do not specify one, your job will run in the <tt>compute</tt> partition, which is the most common case. <br />
<br />
{| class="wikitable"<br />
!Usage<br />
!Partition<br />
!Limit on Running jobs<br />
!Limit on Submitted jobs (incl. running)<br />
!Min. size of jobs<br />
!Max. size of jobs<br />
!Min. walltime<br />
!Max. walltime <br />
|-<br />
|Compute jobs ||compute || 50 || 1000 || 1 node (40&nbsp;cores) || default:&nbsp;20&nbsp;nodes&nbsp;(800&nbsp;cores) <br> with&nbsp;allocation:&nbsp;1000&nbsp;nodes&nbsp;(40000&nbsp;cores)|| 15 minutes || 24 hours<br />
|-<br />
|Testing or troubleshooting || debug || 1 || 1 || 1 node (40&nbsp;cores) || 4 nodes (160 cores)|| N/A || 1 hour<br />
|-<br />
|Archiving or retrieving data in [[HPSS]]|| archivelong || 2 per user (5 in total) || 10 per user || N/A || N/A|| 15 minutes || 72 hours<br />
|-<br />
|Inspecting archived data, small archival actions in [[HPSS]] || archiveshort vfsshort || 2 per user|| 10 per user || N/A || N/A || 15 minutes || 1 hour<br />
|}<br />
<br />
Even if you respect these limits, your jobs will still have to wait in the queue. The waiting time depends on many factors such as your group's allocation amount, how much allocation has been used in the recent past, the number of requested nodes and walltime, and how many other jobs are waiting in the queue.<br />
<br />
== File Input/Output Tips ==<br />
<br />
It is important to understand the file systems, so as to perform your file I/O (Input/Output) responsibly. Refer to the [[Data_Management | Data Management]] page for details about the file systems.<br />
* Your files can be seen on all Niagara login and compute nodes.<br />
* $HOME, $SCRATCH, and $PROJECT all use the parallel file system called GPFS.<br />
* GPFS is a high-performance file system which provides rapid reads and writes to large data sets in parallel from many nodes.<br />
* Accessing data sets which consist of many, small files leads to poor performance on GPFS.<br />
* Avoid reading and writing lots of small amounts of data to disk. Many small files on the system waste space and are slower to access, read and write. If you must write many small files, use [[User_Ramdisk | ramdisk]].<br />
* Write data out in a binary format. This is faster and takes less space.<br />
* The [[Burst Buffer]] is another option for I/O heavy-jobs and for speeding up [[Checkpoints|checkpoints]].<br />
<br />
== Example submission script (MPI) ==<br />
<br />
<source lang="bash">#!/bin/bash <br />
#SBATCH --nodes=2<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --time=1:00:00<br />
#SBATCH --job-name=mpi_job<br />
#SBATCH --output=mpi_output_%j.txt<br />
#SBATCH --mail-type=FAIL<br />
<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
module load NiaEnv/2019b<br />
module load intel/2019u4<br />
module load openmpi/4.0.1<br />
<br />
mpirun ./mpi_example<br />
# or "srun ./mpi_example"<br />
</source><br />
Submit this script from your scratch directory with the command:<br />
<br />
nia-login07:scratch$ sbatch mpi_job.sh<br />
<br />
<ul><br />
<li>First line indicates that this is a bash script.</li><br />
<li>Lines starting with <code>#SBATCH</code> go to SLURM.</li><br />
<li>sbatch reads these lines as a job request (which it gives the name <code>mpi_job</code>)</li><br />
<li>In this case, SLURM looks for 2 nodes each running 40 tasks (for a total of 80 tasks), for 1 hour</li><br />
<li>Note that the mpifun flag "--ppn" (processors per node) is ignored.</li><br />
<li>Once it found such a node, it runs the script:<br />
<ul><br />
<li>Change to the submission directory;</li><br />
<li>Loads modules;</li><br />
<li>Runs the <code>mpi_example</code> application (SLURM will inform mpirun or srun on how many processes to run).<br />
</li><br />
</ul><br />
<li>To use hyperthreading, just change <code>--ntasks-per-node=40</code> to <code>--ntasks-per-node=80</code>, and add <code>--bind-to none</code> to the mpirun command (the latter is necessary for OpenMPI only, not when using IntelMPI).</li><br />
</ul><br />
<br />
== Example submission script (OpenMP) ==<br />
<br />
<source lang="bash">#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --cpus-per-task=40<br />
#SBATCH --time=1:00:00<br />
#SBATCH --job-name=openmp_job<br />
#SBATCH --output=openmp_output_%j.txt<br />
#SBATCH --mail-type=FAIL<br />
<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
module load NiaEnv/2019b<br />
module load intel/2019u4<br />
<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
./openmp_example<br />
# or "srun ./openmp_example".<br />
</source><br />
Submit this script from your scratch directory with the command:<br />
<br />
nia-login07:~$ sbatch openmp_job.sh<br />
<br />
* First line indicates that this is a bash script.<br />
* Lines starting with <code>#SBATCH</code> go to SLURM.<br />
* sbatch reads these lines as a job request (which it gives the name <code>openmp_job</code>) .<br />
* In this case, SLURM looks for one node with 40 cores to be run inside one task, for 1 hour.<br />
* Once it found such a node, it runs the script:<br />
** Change to the submission directory;<br />
** Loads modules;<br />
** Sets an environment variable;<br />
** Runs the <code>openmp_example</code> application.<br />
* To use hyperthreading, just change <code>--cpus-per-task=40</code> to <code>--cpus-per-task=80</code>.<br />
<br />
== Monitoring queued jobs ==<br />
<br />
Once the job is incorporated into the queue, there are some commands you can use to monitor its progress.<br />
<br />
<ul><br />
<li><p><code>squeue</code> or <code>sqc</code> (a caching version of squeue) to show the job queue (<code>squeue -u $USER</code> for just your jobs);</p></li><br />
<li><p><code>squeue -j JOBID</code> to get information on a specific job</p><br />
<p>(alternatively, <code>scontrol show job JOBID</code>, which is more verbose).</p></li><br />
<li><p><code>squeue --start -j JOBID</code> to get an estimate for when a job will run; these tend not to be very accurate predictions.</p></li><br />
<li><p><code>scancel -i JOBID</code> to cancel the job.</p></li><br />
<li><p><code>jobperf JOBID</code> to get an instantaneous view of the cpu and memory usage of the nodes of the job while it is running.</p></li><br />
<li><p><code>sacct</code> to get information on your recent jobs.</p></li><br />
</ul><br />
<br />
Further instructions for monitoring your jobs can be found on the [[Slurm#Monitoring_jobs | Slurm page]]. The [https://my.scinet.utoronto.ca my.SciNet] site is also a very useful tool for monitoring your current and past usage.<br />
<br />
= Visualization =<br />
Information about how to use visualization tools on Niagara is available on [[Visualization]] page.<br />
<br />
= Support =<br />
<br />
* [mailto:support@scinet.utoronto.ca support@scinet.utoronto.ca]<br />
* [mailto:niagara@computecanada.ca niagara@computecanada.ca]</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3063
Main Page
2021-06-09T11:21:54Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Down|Niagara|Niagara_Quickstart}}<br />
|{{Down |Mist|Mist}}<br />
|{{Down |Teach|Teach}}<br />
|{{Down |Rouge|Rouge}}<br />
|-<br />
|{{Down |Jupyter Hub|Jupyter_Hub}}<br />
|{{Down |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Down |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Down |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down |HPSS|HPSS}}<br />
|{{Down |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Down |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down |Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<b>June 9th to 10th, 2021:</b> The SciNet datacentre will have a scheduled maintenance shutdown. Niagara, Mist, Rouge, HPSS, login nodes, the file systems, and hosted systems will all be offline during the shutdown starting at 7AM EDT on Wednesday June 9th. We expect the system to be back up in the morning of Friday June 11th. Check here for updates.<br />
<br />
<b>Jun 5, 2021, 3:10 PM EDT:</b> File issues resolved.<br />
<br />
<b>Jun 5, 2021, 11:15 AM EDT:</b> File systems issues. We are investigating.<br />
<br />
<b>May 27, 2021:</b> Datamovers addresses have changed to improve high bandwidth connectivity and cybersecurity. The new addresses are 142.1.174.227 for nia-datamover1.scinet.utoronto.ca, and 142.1.174.228 for nia-datamover2.scinet.utoronto.ca.<br />
<br />
If you have jobs that need to connect to a software license server using an ssh tunnel through nia-gw (which actually resolves to datamover1 or datamover2), you may need to ask the system administrators of that license server to allow incoming connections from the new addresses above.<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3023
Main Page
2021-05-27T01:30:28Z
<p>Ejspence: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Down|Niagara|Niagara_Quickstart}}<br />
|{{Down|Mist|Mist}}<br />
|{{Down|Teach|Teach}}<br />
|{{Down|Rouge|Rouge}}<br />
|-<br />
|{{Down|Jupyter Hub|Jupyter_Hub}}<br />
|{{Down|Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Down|File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Down|Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down|HPSS|HPSS}}<br />
|{{Down|Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Down|External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down|Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<br />
May 26th, 2021, 9:20pm: we are currently experiencing cooling issues at the SciNet data centre. Updates will be posted as we determine the cause of the problem.<br />
<br />
<!-- Announcement: On June 7th and 8th, 2021, The SciNet datacentre will have a scheduled maintenance shutdown. Niagara, Mist, HPSS, login nodes, the file systems, and hosted systems will all be offline during the shutdown. --><br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Parallel_Debugging_with_DDT&diff=2975
Parallel Debugging with DDT
2021-04-07T13:17:48Z
<p>Ejspence: /* Setting up a client-server connection */</p>
<hr />
<div>==ARM DDT Parallel Debugger==<br />
<br />
For parallel debugging, SciNet has DDT ("Distributed Debugging Tool") installed on all our clusters. DDT is a powerful, GUI-based commercial debugger by ARM (formerly by Allinea). It supports the programming languages C, C++, and Fortran, and the parallel programming paradigms MPI, OpenMPI, and CUDA. DDT can also be very useful for serial programs. DDT provides a nice, intuitive graphical user interface. It does need graphics support, so make sure to use the '-X' or '-Y' arguments to your ssh commands, so that X11 graphics can find its way back to your screen ("X forwarding").<br />
<br />
The most currently installed version of ddt on [[Niagara_Quickstart | Niagara]] is DDT 20.1.3. The ddt license allows up to a total of 64 processes to be debugged simultaneously (shared among all users).<br />
<br />
To use ddt, ssh in with X forwarding enabled, load your usual compiler and mpi modules, compile your code with '-g' and load the module<br />
<br />
<code>module load ddt</code><br />
<br />
You can then start ddt with one of the following commands:<br />
<br />
<code>ddt</code><br />
<br />
<code>ddt <executable compiled with -g flag> </code><br />
<br />
<code>ddt <executable compiled with -g flag> <arguments> </code><br />
<br />
<code>ddt -n <numprocs> <executable compiled with -g flag> <arguments> </code><br />
<br />
The first time you run DDT, it will set up configuration files. It puts these in the hidden directory $SCRATCH/.allinea. <br />
<br />
Note that most users will debug on the login nodes of the a clusters (nia-login0{1-3,5-7}), but that this is only appropriate if the number of mpi processes and threads is small, and the memory usage is not too large. If your debugging requires more resources, you should run it through the queue. On Niagara, an interactive debug session will suit most debugging purposes.<br />
<br />
==ARM MAP Parallel Profiler==<br />
<br />
MAP is a parallel (MPI) performance analyser with a graphical interface. It is part of the same DDT module, so you need to load <tt>ddt</tt> to use MAP (together, DDT and MAP form the <i>ARM Forge</i> bundel).<br />
<br />
It has a similar job startup interface as DDT. <br />
<br />
To be more precise, MAP is a sampling profiler with adaptive sampling rates to keep the<br />
data volumes collected under control. Samples are aggregated at all levels to preserve key features of<br />
a run without drowning in data. A folding code and stack viewer allows you to zoom into time<br />
spent on individual lines and draw back to see the big picture across nests of routines. MAP measures memory usage, floating-point calculations, MPI usage, as well as I/O.<br />
<br />
The maximum number of MPI processes for that our MAP license supports is 64 (simultaneously shared among all users).<br />
<br />
It supports both interactive and batch modes for gathering profile data.<br />
<br />
===Interactive profiling with MAP===<br />
<br />
Startup is much the same as for DDT:<br />
<br />
<code>map</code><br />
<br />
<code>map <executable compiled with -g flag> </code><br />
<br />
<code>map <executable compiled with -g flag> <arguments> </code><br />
<br />
<code>map -n <numprocs> <executable compiled with -g flag> <arguments> </code><br />
<br />
After you have started the code and it has run to completion, MAP will show the results. It will also save these results in a file with the extension <tt>.map</tt>. This allows you to load the result again into the graphical user interface at a later time.<br />
<br />
===Non-interactive profiling with MAP===<br />
<br />
It is also possible to run map non-interactively by passing the <tt>-profile</tt> flag, e.g.<br />
<br />
<code>map -profile -n <numprocs> <executable compiled with -g flag> <arguments> </code><br />
<br />
For instance, this could be used in a job when it is launched with a jobscript like<br />
<br />
<source lang="bash">#!/bin/bash <br />
#SBATCH --nodes=1<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --time=1:00:00<br />
#SBATCH --job-name=mpi_job<br />
#SBATCH --output=mpi_output_%j.txt<br />
#SBATCH --mail-type=FAIL<br />
<br />
module load intel/2018.2<br />
module load openmpi/3.1.0<br />
module load ddt<br />
<br />
map -profile -n $SLURM_NTASKS ./mpi_example<br />
</source><br />
<br />
This will just create the <tt>.map</tt> file, which you could inspect after the job has finished with<br />
<br />
<code>map MAPFILE</code><br />
<br />
==Parallel Debugging and Profiling in an Interactive Session on Niagara==<br />
<br />
By requesting a job from the 'debug' partition on Niagara, you can have access to at most 4 nodes, i.e., a total of 160 physical cores (or 320 virtual cores, using hyper-threading), for your exclusive, interactive use. Starting from a Niagara login node, you would request a debug sessions with the following command:<br />
<br />
<code>debugjob <numberofnodes></code><br />
<br />
where <tt><numberofnodes></tt> is 1, 2, 3, or 4. The sessions will last 60, 45, 30, or 15 minutes, depending on the number of nodes requested.<br />
<br />
This command will get you a prompt on a compute node (or on the 'head' node if you've asked for more than one node). Reload any modules that your application needs (e.g. <tt>module load intel openmpi</tt>), as well as the <tt>ddt</tt> module.<br />
<br />
Note that on compute nodes, $HOME is read-only, so unless your code is on $SCRATCH, you cannot recompile it (with '-g') in the debug session; this should have been done on a login node.<br />
<br />
If the time restrictions of these debugjobs is too great, you need to request nodes from the regular queue. In that case, you want to make sure that you get [[Testing_With_Graphics|X11 graphics forwarded properly]].<br />
<br />
Within this debugjob session, you can then use the <tt>ddt</tt> and <tt>map</tt> commands.<br />
<br />
==Setting up a client-server connection==<br />
<br />
If you're working from home, or any other location where there isn't a fast internet connection, it is likely to be advantageous to run DDT or MAP in client-server mode. This keeps the bulk of the computation on Niagara or Mist (the server), while sending only the minimum amount of information over the internet to your locally-running version of DDT (the client).<br />
<br />
===Setting up the server side===<br />
<br />
The first step is to connect to Niagara (or Mist), and start a debug session<br />
<br />
ejspence@nia-login01 $ debugjob -N 1<br />
debugjob: Requesting 1 node(s) with 40 core(s) for 60 minutes and 0 seconds<br />
SALLOC: Granted job allocation 3995470<br />
SALLOC: Waiting for resource configuration<br />
SALLOC: Nodes nia0003 are ready for job<br />
ejspence@nia0003 $<br />
<br />
This will start an interactive debug session, on a single node, for an hour. Be sure to note the node which you have been allocated (nia0003 in this case).<br />
<br />
The next step is to determine the path to DDT. To do this you will need load the DDT module:<br />
<br />
ejspence@nia0003 $ module load NiaEnv/2019b<br />
ejspence@nia0003 $ module load ddt/19.1<br />
ejspence@nia0003 $<br />
ejspence@nia0003 $ echo $SCINET_DDT_ROOT<br />
/scinet/niagara/software/2019b/opt/base/ddt/19.1<br />
ejspence@nia0003 $<br />
<br />
The next step is to create a startup script which will be run by the server, in case you are running on multiple nodes:<br />
<br />
#!/bin/bash<br />
module purge<br />
module load NiaEnv/2019b<br />
module load gcc/8.3.0 openmpi/4.0.1 ddt/19.1<br />
export ARM_TOOLS_CONFIG_DIR=${SCRATCH}/.arm<br />
mkdir -p ${ARM_TOOLS_CONFIG_DIR}<br />
export OMPI_MCA_pml=ob1<br />
<br />
Be sure to load whatever modules your code needs to run. Let us assume that the PATH to this script is $SCRATCH/ddt_remote_setup.sh.<br />
<br />
This completes the setup of the server side. There is no need to launch the server, the client itself will do this.<br />
<br />
===Setting up the client side===<br />
<br />
You now need to setup the client on your local machine (desktop or laptop). The first step is to go to [https://developer.arm.com/tools-and-software/server-and-hpc/downloads/arm-forge/older-versions-of-remote-client-for-arm-forge this page] to download the Arm Forge client. Note that this page is for older versions of DDT. This is because the client and the server must be running the same version of DDT, and the version on Niagara is 19.1. Download the version of the client appropriate for your local machine, and install it.<br />
<br />
Now launch Arm Forge. You will see a screen similar to below. Select "Remote Launch", then "Configure".<br />
<br />
{| align="center"<br />
| [[File:DDT openning.png|480px|]]<br />
|}<br />
<br />
You will see that there are no sessions already configured. Click on "Add" to create a new session configuration.<br />
<br />
{| align="center"<br />
| [[File:DDT sessions.png|480px|]]<br />
|}<br />
<br />
Next, fill in the details of the session. You need to fill in <br />
* the name of the session,<br />
* the host name, consisting of<br />
** your login credentials for Niagara (or Mist),<br />
** a space,<br />
** you user name and the node you are using (nia0003 in this example),<br />
* the installation directory of DDT on Niagara,<br />
* the location of your startup script.<br />
<br />
{| align="center"<br />
| [[File:DDT settings.png|540px|]]<br />
|}<br />
<br />
After you've entered the settings, click on "OK". This should bring you to the screen seen below.<br />
<br />
{| align="center"<br />
| [[File:DDT sessions2.png|480px|]]<br />
|}<br />
<br />
The openning screen should now look like below.<br />
<br />
{| align="center"<br />
| [[File:DDT openning2.png|480px|]]<br />
|}<br />
<br />
Click on the session you'd like to launch. In this example, "DDT Test". This will bring you to DDT's launch screen, which is the same you'll see when you run DDT normally. Note that the code and files you will be testing must be hosted on Niagara, not on your local machine.</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=R&diff=2907
R
2021-01-20T03:04:50Z
<p>Ejspence: /* Submitting an R cluster job */</p>
<hr />
<div>[http://www.r-project.org/ R] is a programing language that continues to grow in popularity for data analysis. It is very fast to write code in, but the software that results is much much slower than C or Fortran; one should be wary of doing too much compute-intensive work in R.<br />
<br />
==Running R on Niagara==<br />
<br />
We currently have two families of R installed on Niagara. <br />
* Anaconda R<br />
* regular R<br />
<br />
Here we describe the differences between these packages.<br />
<br />
=== Anaconda R===<br />
<br />
Anaconda is a pre-assembled set of commonly-used data science tools, which recently added R to its suite of packages. The source for this collection is [https://anaconda.org/r here]. <br />
<br />
As of 30 July 2018 the following Anaconda modules are available:<br />
<br />
$ module avail anaconda<br />
----------------- /scinet/niagara/software/2018a/modules/base ------------------<br />
anaconda2/5.1.0 python/2.7.14-anaconda5.1.0 r/3.4.3-anaconda5.1.0<br />
anaconda3/5.1.0 python/3.6.4-anaconda5.1.0<br />
<br />
Note that there is a single Anaconda R module available, and that none of these modules require a compiler to be loaded. The Anaconda R module is R version 3.4.3, which comes from the Anaconda version 5.1.0.<br />
<br />
You load the module in the usual way:<br />
<br />
$ module load r/3.4.3-anaconda5.1.0<br />
$ R<br />
><br />
<br />
=== Regular R ===<br />
<br />
The base R program has also been installed from source. This installation comes with no R packages installed other than the base installation.<br />
<br />
$ module spider r<br />
--------------------------------------------------------------------------------<br />
r:<br />
--------------------------------------------------------------------------------<br />
Versions:<br />
r/3.4.3-anaconda5.1.0<br />
r/3.5.0<br />
$<br />
$ module spider r/3.5.0<br />
--------------------------------------------------------------------------------<br />
r: r/3.5.0<br />
--------------------------------------------------------------------------------<br />
You will need to load all module(s) on any one of the lines below before the "r/3.5.0" module is available to load.<br />
intel/2018.2<br />
$<br />
$ module load intel/2018.2 r/3.5.0<br />
$ R<br />
><br />
<br />
<!--<br />
(The intel module is a prerequesite for the R module). If you will be using Rmpi, you will need to load the openmpi module as well.--><br />
<br />
Many optional packages are available for R which add functionality for specific domains; they are available through the [http://cran.r-project.org/mirrors.html Comprehensive R Archive Network (CRAN)]. <br />
<br />
R provides an easy way for users to install the libraries they need in their home directories rather than having them installed system-wide; there are so many potential optional packages for R people could potentially want, we recommend users who want additional packages to proceed this way. This is almost certainly the easiest way to deal with the wide range of packages, ensure they're up to date, and ensure that users package choices don't conflict. <br />
<br />
In general, you can install those that you need yourself in your home directory; eg, <br />
<br />
<pre><br />
$ R <br />
> install.packages("package-name", dependencies = TRUE)<br />
</pre><br />
<br />
will download and compile the source for the packages you need in your home directory under <tt>${HOME}/R/x86_64-unknown-linux-gnu-library/2.11/</tt> (you can specify another directory with a lib= option.) Then take a look at help(".libPaths") to make sure that R knows where to look for the packages you've compiled. Note that you must install packages with logged into a development node as write access to the library folder is not available to a standard node on the cluster. <br />
<!--<br />
Note that during the installation you may get warnings that the packages cannot be installed in e.g. /scinet/gpc/Applications/R/3.0.1/lib64/R/bin/. But after those messages, R should have succeeded in installing the package into your home directory.--><br />
<br />
=== Running serial R jobs ===<br />
<br />
As with all serial jobs, if your R computation does not use multiple cores, you should bundle them up so the 40 cores of a node are all performing work. Examples of this can be found on [[Running_Serial_Jobs_on_Niagara | this page]].<br />
<!--<br />
== Saving images from R in compute jobs ==<br />
<br />
To make use of the graphics capability of R, R insists on having an X server running, even if you're just writing to a file. There is no X server on the compute nodes, and you'd get a message like<br />
<br />
unable to open connection to X11 display ''<br />
<br />
To get around this issue, you can run a 'virtual' X server on the compute nodes by adding the following commands at the start of your job script:<br />
<br />
# Make virtual X server command called Xvfb available:<br />
module load Xlibraries<br />
<br />
# Select a unique display number:<br />
let DISPLAYNUM=$UID%65274<br />
export DISPLAY=":$DISPLAYNUM"<br />
<br />
# Start the virtual X server<br />
Xvfb $DISPLAY -fp $SCINET_FONTPATH -ac 2>/dev/null &<br />
<br />
After this, run R or Rscript as usual. The virtual X server will be running in the background and will get killed which your job is done. Alternatively, you may want to kill it explicitly at the end of you job script using <br />
<br />
# Kill any remaining Xvfb server<br />
pkill -u $UID Xvfb<br />
--><br />
<br />
== Using a Jupyter Notebook ==<br />
<br />
You may develop your R scripts in a Jupyter Notebook on Niagara. A node has been set aside as a Jupyter Hub. See [[Jupyter_Hub | this page]] for details on how to access that node, and develop your code.<br />
<br />
== Rmpi (R with MPI) ==<br />
<br />
None of the R installations on Niagara have Rmpi installed by default. <br />
<br />
=== Installing Rmpi, version 3.5.0 ===<br />
<br />
Version 3.5.0 does not have the Rmpi library as a standard package, which means you have to install it yourself if you are using that version. The same is true if you want to use IntelMPI instead of OpenMPI. <br />
<br />
Installing the Rmpi package can be a bit challenging, since some additional parameters need to be given to the installation, which contain the path to various header files and libraries. These paths differ depending on what MPI version you are using. <br />
<br />
The various MPI versions on Niagara are loaded with the module command. So the first thing to do is to decide what MPI version to use (OpenMPI or IntelMPI), and to type the corresponding "module load" command on the command-line (as well as in your jobs scripts).<br />
<br />
Because the MPI modules define all the paths in environment variables, the following line seem to work for installations of all OpenMPI versions.<br />
<br />
<pre><br />
$ module load intel/2018.2<br />
$ module load openmpi/3.1.0<br />
$ module load r/3.5.0<br />
$<br />
$ R<br />
><br />
> install.packages("Rmpi",<br />
configure.args =<br />
c(paste("--with-Rmpi-include=", Sys.getenv("SCINET_OPENMPI_ROOT"), "/include", sep=""),<br />
paste("--with-Rmpi-libpath=", Sys.getenv("SCINET_OPENMPI_ROOT"), "lib", sep=""),<br />
"--with-Rmpi-type=OPENMPI"))<br />
</pre><br />
<br />
For intelmpi, you only need to change <tt>OPENMPI</tt> to <tt>MPICH2</tt> in the last line.<br />
<br />
=== Running Rmpi ===<br />
<br />
To start using R with Rmpi, make sure you have all require modules loaded (e.g. <tt>module load intel/2018.2 openmpi/3.1.0 r/3.5.0</tt>), then launch it with<br />
<pre><br />
$ mpirun -np 1 R --no-save<br />
</pre><br />
which starts one master mpi process, but starts up the infrastructure to be able to spawn additional processes.<br />
<br />
== Creating an R cluster ==<br />
<br />
The 'parallel' package allows you to use R to launch individual serial subjobs across multiple nodes. This section describes how this is accomplished.<br />
<br />
=== Creating your Rscript wrapper ===<br />
<br />
The first thing to do is create a wrapper for Rscript. This needs to be done because the R module needs to be loaded on all nodes, but the submission script only loads modules on the head node of the job. The wrapper script, let's call it MyRscript.sh, is short:<br />
<pre><br />
#!/bin/bash<br />
<br />
module load intel/2019u4 r/3.6.1<br />
${SCINET_R_ROOT}/bin/Rscript --no-restore "$@"<br />
</pre><br />
The "--no-restore" flag prevents Rscript from loading your "workspace image", if you have one saved. Loading the image causes problems for the cluster.<br />
<br />
Once you've created your wrapper, make it executable:<br />
<pre><br />
$ chmod u+x MyRscript.sh<br />
</pre><br />
Your wrapper is now ready to be used.<br />
<br />
=== The cluster R code ===<br />
The R code which we will run consists of two parts, the code which launches the cluster, and does pre- and post-analysis, and the code which is run on the individual cluster "nodes". Here is some code which demonstrates this functionality. Let's call it MyClusterCode.R.<br />
<pre><br />
<br />
######################################################<br />
#<br />
# worker code<br />
#<br />
<br />
# first define the function which will be run on all the cluster nodes. This is just a test function. <br />
# Put your real worker code here.<br />
testfunc <- function(a) {<br />
<br />
# this part is just to waste time<br />
b <- 0<br />
for (i in 1:100000000) {<br />
b <- b + 1<br />
}<br />
<br />
s <- Sys.info()['nodename']<br />
return(paste0(s, " ", a[1], " ", a[2]))<br />
<br />
}<br />
<br />
<br />
######################################################<br />
#<br />
# head node code<br />
#<br />
<br />
# Create a bunch of index pairs to feed to the worker function. These could be parameters,<br />
# or whatever your code needs to vary across jobs. Note that the worker function only <br />
# takes a single argument; each entry in the list must contain all the information <br />
# that the function needs to run. In this example, each entry contains a list which<br />
# contains two pieces of information, a pair of indices.<br />
indexlist <- list()<br />
index <- 1<br />
for (i in 1:10) {<br />
for (j in 1:10) {<br />
indexlist[index] <- list(c(i,j))<br />
index <- index +1<br />
}<br />
}<br />
<br />
<br />
# Now set up the cluster.<br />
<br />
# First load the parallel library.<br />
library(parallel)<br />
<br />
# Next find all the nodes which the scheduler has given to us.<br />
# These are given by the SLURM_JOB_NODELIST environment variable.<br />
nodelist <- Sys.getenv("SLURM_JOB_NODELIST")<br />
<br />
# Get your SCRATCH directory.<br />
my.scratch <- Sys.getenv("SCRATCH")<br />
<br />
<br />
node_ids <- unlist(strsplit(nodelist,split="[^a-z0-9-]"))[-1]<br />
<br />
if (length(node_ids)>0) {<br />
expanded_ids <- lapply(node_ids, function (id) {<br />
ranges <- as.numeric(<br />
unlist(strsplit(id, split="[-]"))<br />
)<br />
if (length(ranges)>1) seq(ranges[1], ranges[2], by=1) else ranges<br />
})<br />
<br />
nodelist <- sprintf("nia%04d", unlist(expanded_ids))<br />
}<br />
<br />
## We now have the nodelist, but we need to repeat it for each core in the nodes.<br />
nodelist <- rep(nodelist, 40)<br />
<br />
# Now launch the cluster, using the list of nodes and our Rscript<br />
# wrapper. This assumes that your MyRscript.sh is at the base of your SCRATCH directory.<br />
cl <- makePSOCKcluster(names = nodelist, rscript = paste(my.scratch, "/MyRscript.sh", sep = ""))<br />
<br />
# Now run the worker code, using the parameter list we created above.<br />
result <- clusterApplyLB(cl, indexlist, testfunc)<br />
<br />
# The results of all the jobs will now be put in the 'result' variable,<br />
# in the order they were specified in the 'indexlist' variable.<br />
<br />
# Don't forget to stop the cluster when you're finished.<br />
stopCluster(cl)<br />
</pre><br />
You can, of course, add any post-processing code you need to the above code.<br />
<br />
=== Submitting an R cluster job ===<br />
You are now ready to submit your job to the Niagara queue. The submission script is like most others:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --time=5:00:00<br />
#SBATCH --job-name MyRCluster<br />
<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
module load intel/2019u4 r/3.6.1<br />
<br />
${SCINET_R_ROOT}/bin/Rscript --no-restore MyClusterCode.R<br />
</pre><br />
Be sure to use whatever number of nodes, length of time, etc., is appropriate for your job.<br />
<br />
== SciNet's R Classes ==<br />
<br />
There is a dizzying amount of documentation available for programming in R; consult your favourite search engine. That begin said, SciNet runs several classes each year on using R for research:<br />
* [https://support.scinet.utoronto.ca/education/browse.php?category=-1&search=msc1090&include=all&filter=Filter MSC1090]: Introduction to Computational BioStatistics with R. This class graduate-level [https://ims.utoronto.ca IMS]-sponsored class is open to graduate students in IMS or other fields. This class is intended for those with little-to-no programming experience who wish to use R in scientific research.<br />
* [https://support.scinet.utoronto.ca/education/browse.php?category=-1&search=ees1137&include=all&filter=Filter EES1137]: Quantitative Applications for Data Analysis. [https://www.utsc.utoronto.ca/gradpes/ees1137h-quantitative-applications-data-analysis This class] is similar to MSC1090, but takes class at UTSC, and is sponsored by the department of Physical and Environmental Sciences.age].</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=R&diff=2906
R
2021-01-20T03:03:44Z
<p>Ejspence: /* The cluster R code */</p>
<hr />
<div>[http://www.r-project.org/ R] is a programing language that continues to grow in popularity for data analysis. It is very fast to write code in, but the software that results is much much slower than C or Fortran; one should be wary of doing too much compute-intensive work in R.<br />
<br />
==Running R on Niagara==<br />
<br />
We currently have two families of R installed on Niagara. <br />
* Anaconda R<br />
* regular R<br />
<br />
Here we describe the differences between these packages.<br />
<br />
=== Anaconda R===<br />
<br />
Anaconda is a pre-assembled set of commonly-used data science tools, which recently added R to its suite of packages. The source for this collection is [https://anaconda.org/r here]. <br />
<br />
As of 30 July 2018 the following Anaconda modules are available:<br />
<br />
$ module avail anaconda<br />
----------------- /scinet/niagara/software/2018a/modules/base ------------------<br />
anaconda2/5.1.0 python/2.7.14-anaconda5.1.0 r/3.4.3-anaconda5.1.0<br />
anaconda3/5.1.0 python/3.6.4-anaconda5.1.0<br />
<br />
Note that there is a single Anaconda R module available, and that none of these modules require a compiler to be loaded. The Anaconda R module is R version 3.4.3, which comes from the Anaconda version 5.1.0.<br />
<br />
You load the module in the usual way:<br />
<br />
$ module load r/3.4.3-anaconda5.1.0<br />
$ R<br />
><br />
<br />
=== Regular R ===<br />
<br />
The base R program has also been installed from source. This installation comes with no R packages installed other than the base installation.<br />
<br />
$ module spider r<br />
--------------------------------------------------------------------------------<br />
r:<br />
--------------------------------------------------------------------------------<br />
Versions:<br />
r/3.4.3-anaconda5.1.0<br />
r/3.5.0<br />
$<br />
$ module spider r/3.5.0<br />
--------------------------------------------------------------------------------<br />
r: r/3.5.0<br />
--------------------------------------------------------------------------------<br />
You will need to load all module(s) on any one of the lines below before the "r/3.5.0" module is available to load.<br />
intel/2018.2<br />
$<br />
$ module load intel/2018.2 r/3.5.0<br />
$ R<br />
><br />
<br />
<!--<br />
(The intel module is a prerequesite for the R module). If you will be using Rmpi, you will need to load the openmpi module as well.--><br />
<br />
Many optional packages are available for R which add functionality for specific domains; they are available through the [http://cran.r-project.org/mirrors.html Comprehensive R Archive Network (CRAN)]. <br />
<br />
R provides an easy way for users to install the libraries they need in their home directories rather than having them installed system-wide; there are so many potential optional packages for R people could potentially want, we recommend users who want additional packages to proceed this way. This is almost certainly the easiest way to deal with the wide range of packages, ensure they're up to date, and ensure that users package choices don't conflict. <br />
<br />
In general, you can install those that you need yourself in your home directory; eg, <br />
<br />
<pre><br />
$ R <br />
> install.packages("package-name", dependencies = TRUE)<br />
</pre><br />
<br />
will download and compile the source for the packages you need in your home directory under <tt>${HOME}/R/x86_64-unknown-linux-gnu-library/2.11/</tt> (you can specify another directory with a lib= option.) Then take a look at help(".libPaths") to make sure that R knows where to look for the packages you've compiled. Note that you must install packages with logged into a development node as write access to the library folder is not available to a standard node on the cluster. <br />
<!--<br />
Note that during the installation you may get warnings that the packages cannot be installed in e.g. /scinet/gpc/Applications/R/3.0.1/lib64/R/bin/. But after those messages, R should have succeeded in installing the package into your home directory.--><br />
<br />
=== Running serial R jobs ===<br />
<br />
As with all serial jobs, if your R computation does not use multiple cores, you should bundle them up so the 40 cores of a node are all performing work. Examples of this can be found on [[Running_Serial_Jobs_on_Niagara | this page]].<br />
<!--<br />
== Saving images from R in compute jobs ==<br />
<br />
To make use of the graphics capability of R, R insists on having an X server running, even if you're just writing to a file. There is no X server on the compute nodes, and you'd get a message like<br />
<br />
unable to open connection to X11 display ''<br />
<br />
To get around this issue, you can run a 'virtual' X server on the compute nodes by adding the following commands at the start of your job script:<br />
<br />
# Make virtual X server command called Xvfb available:<br />
module load Xlibraries<br />
<br />
# Select a unique display number:<br />
let DISPLAYNUM=$UID%65274<br />
export DISPLAY=":$DISPLAYNUM"<br />
<br />
# Start the virtual X server<br />
Xvfb $DISPLAY -fp $SCINET_FONTPATH -ac 2>/dev/null &<br />
<br />
After this, run R or Rscript as usual. The virtual X server will be running in the background and will get killed which your job is done. Alternatively, you may want to kill it explicitly at the end of you job script using <br />
<br />
# Kill any remaining Xvfb server<br />
pkill -u $UID Xvfb<br />
--><br />
<br />
== Using a Jupyter Notebook ==<br />
<br />
You may develop your R scripts in a Jupyter Notebook on Niagara. A node has been set aside as a Jupyter Hub. See [[Jupyter_Hub | this page]] for details on how to access that node, and develop your code.<br />
<br />
== Rmpi (R with MPI) ==<br />
<br />
None of the R installations on Niagara have Rmpi installed by default. <br />
<br />
=== Installing Rmpi, version 3.5.0 ===<br />
<br />
Version 3.5.0 does not have the Rmpi library as a standard package, which means you have to install it yourself if you are using that version. The same is true if you want to use IntelMPI instead of OpenMPI. <br />
<br />
Installing the Rmpi package can be a bit challenging, since some additional parameters need to be given to the installation, which contain the path to various header files and libraries. These paths differ depending on what MPI version you are using. <br />
<br />
The various MPI versions on Niagara are loaded with the module command. So the first thing to do is to decide what MPI version to use (OpenMPI or IntelMPI), and to type the corresponding "module load" command on the command-line (as well as in your jobs scripts).<br />
<br />
Because the MPI modules define all the paths in environment variables, the following line seem to work for installations of all OpenMPI versions.<br />
<br />
<pre><br />
$ module load intel/2018.2<br />
$ module load openmpi/3.1.0<br />
$ module load r/3.5.0<br />
$<br />
$ R<br />
><br />
> install.packages("Rmpi",<br />
configure.args =<br />
c(paste("--with-Rmpi-include=", Sys.getenv("SCINET_OPENMPI_ROOT"), "/include", sep=""),<br />
paste("--with-Rmpi-libpath=", Sys.getenv("SCINET_OPENMPI_ROOT"), "lib", sep=""),<br />
"--with-Rmpi-type=OPENMPI"))<br />
</pre><br />
<br />
For intelmpi, you only need to change <tt>OPENMPI</tt> to <tt>MPICH2</tt> in the last line.<br />
<br />
=== Running Rmpi ===<br />
<br />
To start using R with Rmpi, make sure you have all require modules loaded (e.g. <tt>module load intel/2018.2 openmpi/3.1.0 r/3.5.0</tt>), then launch it with<br />
<pre><br />
$ mpirun -np 1 R --no-save<br />
</pre><br />
which starts one master mpi process, but starts up the infrastructure to be able to spawn additional processes.<br />
<br />
== Creating an R cluster ==<br />
<br />
The 'parallel' package allows you to use R to launch individual serial subjobs across multiple nodes. This section describes how this is accomplished.<br />
<br />
=== Creating your Rscript wrapper ===<br />
<br />
The first thing to do is create a wrapper for Rscript. This needs to be done because the R module needs to be loaded on all nodes, but the submission script only loads modules on the head node of the job. The wrapper script, let's call it MyRscript.sh, is short:<br />
<pre><br />
#!/bin/bash<br />
<br />
module load intel/2019u4 r/3.6.1<br />
${SCINET_R_ROOT}/bin/Rscript --no-restore "$@"<br />
</pre><br />
The "--no-restore" flag prevents Rscript from loading your "workspace image", if you have one saved. Loading the image causes problems for the cluster.<br />
<br />
Once you've created your wrapper, make it executable:<br />
<pre><br />
$ chmod u+x MyRscript.sh<br />
</pre><br />
Your wrapper is now ready to be used.<br />
<br />
=== The cluster R code ===<br />
The R code which we will run consists of two parts, the code which launches the cluster, and does pre- and post-analysis, and the code which is run on the individual cluster "nodes". Here is some code which demonstrates this functionality. Let's call it MyClusterCode.R.<br />
<pre><br />
<br />
######################################################<br />
#<br />
# worker code<br />
#<br />
<br />
# first define the function which will be run on all the cluster nodes. This is just a test function. <br />
# Put your real worker code here.<br />
testfunc <- function(a) {<br />
<br />
# this part is just to waste time<br />
b <- 0<br />
for (i in 1:100000000) {<br />
b <- b + 1<br />
}<br />
<br />
s <- Sys.info()['nodename']<br />
return(paste0(s, " ", a[1], " ", a[2]))<br />
<br />
}<br />
<br />
<br />
######################################################<br />
#<br />
# head node code<br />
#<br />
<br />
# Create a bunch of index pairs to feed to the worker function. These could be parameters,<br />
# or whatever your code needs to vary across jobs. Note that the worker function only <br />
# takes a single argument; each entry in the list must contain all the information <br />
# that the function needs to run. In this example, each entry contains a list which<br />
# contains two pieces of information, a pair of indices.<br />
indexlist <- list()<br />
index <- 1<br />
for (i in 1:10) {<br />
for (j in 1:10) {<br />
indexlist[index] <- list(c(i,j))<br />
index <- index +1<br />
}<br />
}<br />
<br />
<br />
# Now set up the cluster.<br />
<br />
# First load the parallel library.<br />
library(parallel)<br />
<br />
# Next find all the nodes which the scheduler has given to us.<br />
# These are given by the SLURM_JOB_NODELIST environment variable.<br />
nodelist <- Sys.getenv("SLURM_JOB_NODELIST")<br />
<br />
# Get your SCRATCH directory.<br />
my.scratch <- Sys.getenv("SCRATCH")<br />
<br />
<br />
node_ids <- unlist(strsplit(nodelist,split="[^a-z0-9-]"))[-1]<br />
<br />
if (length(node_ids)>0) {<br />
expanded_ids <- lapply(node_ids, function (id) {<br />
ranges <- as.numeric(<br />
unlist(strsplit(id, split="[-]"))<br />
)<br />
if (length(ranges)>1) seq(ranges[1], ranges[2], by=1) else ranges<br />
})<br />
<br />
nodelist <- sprintf("nia%04d", unlist(expanded_ids))<br />
}<br />
<br />
## We now have the nodelist, but we need to repeat it for each core in the nodes.<br />
nodelist <- rep(nodelist, 40)<br />
<br />
# Now launch the cluster, using the list of nodes and our Rscript<br />
# wrapper. This assumes that your MyRscript.sh is at the base of your SCRATCH directory.<br />
cl <- makePSOCKcluster(names = nodelist, rscript = paste(my.scratch, "/MyRscript.sh", sep = ""))<br />
<br />
# Now run the worker code, using the parameter list we created above.<br />
result <- clusterApplyLB(cl, indexlist, testfunc)<br />
<br />
# The results of all the jobs will now be put in the 'result' variable,<br />
# in the order they were specified in the 'indexlist' variable.<br />
<br />
# Don't forget to stop the cluster when you're finished.<br />
stopCluster(cl)<br />
</pre><br />
You can, of course, add any post-processing code you need to the above code.<br />
<br />
=== Submitting an R cluster job ===<br />
You are now ready to submit your job to the Niagara queue. The submission script is like most others:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --time=5:00:00<br />
#SBATCH --job-name MyRCluster<br />
<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
module load intel/2018.2 r/3.5.0<br />
<br />
${SCINET_R_ROOT}/bin/Rscript --no-restore MyClusterCode.R<br />
</pre><br />
Be sure to use whatever number of nodes, length of time, etc., is appropriate for your job.<br />
<br />
== SciNet's R Classes ==<br />
<br />
There is a dizzying amount of documentation available for programming in R; consult your favourite search engine. That begin said, SciNet runs several classes each year on using R for research:<br />
* [https://support.scinet.utoronto.ca/education/browse.php?category=-1&search=msc1090&include=all&filter=Filter MSC1090]: Introduction to Computational BioStatistics with R. This class graduate-level [https://ims.utoronto.ca IMS]-sponsored class is open to graduate students in IMS or other fields. This class is intended for those with little-to-no programming experience who wish to use R in scientific research.<br />
* [https://support.scinet.utoronto.ca/education/browse.php?category=-1&search=ees1137&include=all&filter=Filter EES1137]: Quantitative Applications for Data Analysis. [https://www.utsc.utoronto.ca/gradpes/ees1137h-quantitative-applications-data-analysis This class] is similar to MSC1090, but takes class at UTSC, and is sponsored by the department of Physical and Environmental Sciences.age].</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=R&diff=2905
R
2021-01-20T03:00:51Z
<p>Ejspence: /* The cluster R code */</p>
<hr />
<div>[http://www.r-project.org/ R] is a programing language that continues to grow in popularity for data analysis. It is very fast to write code in, but the software that results is much much slower than C or Fortran; one should be wary of doing too much compute-intensive work in R.<br />
<br />
==Running R on Niagara==<br />
<br />
We currently have two families of R installed on Niagara. <br />
* Anaconda R<br />
* regular R<br />
<br />
Here we describe the differences between these packages.<br />
<br />
=== Anaconda R===<br />
<br />
Anaconda is a pre-assembled set of commonly-used data science tools, which recently added R to its suite of packages. The source for this collection is [https://anaconda.org/r here]. <br />
<br />
As of 30 July 2018 the following Anaconda modules are available:<br />
<br />
$ module avail anaconda<br />
----------------- /scinet/niagara/software/2018a/modules/base ------------------<br />
anaconda2/5.1.0 python/2.7.14-anaconda5.1.0 r/3.4.3-anaconda5.1.0<br />
anaconda3/5.1.0 python/3.6.4-anaconda5.1.0<br />
<br />
Note that there is a single Anaconda R module available, and that none of these modules require a compiler to be loaded. The Anaconda R module is R version 3.4.3, which comes from the Anaconda version 5.1.0.<br />
<br />
You load the module in the usual way:<br />
<br />
$ module load r/3.4.3-anaconda5.1.0<br />
$ R<br />
><br />
<br />
=== Regular R ===<br />
<br />
The base R program has also been installed from source. This installation comes with no R packages installed other than the base installation.<br />
<br />
$ module spider r<br />
--------------------------------------------------------------------------------<br />
r:<br />
--------------------------------------------------------------------------------<br />
Versions:<br />
r/3.4.3-anaconda5.1.0<br />
r/3.5.0<br />
$<br />
$ module spider r/3.5.0<br />
--------------------------------------------------------------------------------<br />
r: r/3.5.0<br />
--------------------------------------------------------------------------------<br />
You will need to load all module(s) on any one of the lines below before the "r/3.5.0" module is available to load.<br />
intel/2018.2<br />
$<br />
$ module load intel/2018.2 r/3.5.0<br />
$ R<br />
><br />
<br />
<!--<br />
(The intel module is a prerequesite for the R module). If you will be using Rmpi, you will need to load the openmpi module as well.--><br />
<br />
Many optional packages are available for R which add functionality for specific domains; they are available through the [http://cran.r-project.org/mirrors.html Comprehensive R Archive Network (CRAN)]. <br />
<br />
R provides an easy way for users to install the libraries they need in their home directories rather than having them installed system-wide; there are so many potential optional packages for R people could potentially want, we recommend users who want additional packages to proceed this way. This is almost certainly the easiest way to deal with the wide range of packages, ensure they're up to date, and ensure that users package choices don't conflict. <br />
<br />
In general, you can install those that you need yourself in your home directory; eg, <br />
<br />
<pre><br />
$ R <br />
> install.packages("package-name", dependencies = TRUE)<br />
</pre><br />
<br />
will download and compile the source for the packages you need in your home directory under <tt>${HOME}/R/x86_64-unknown-linux-gnu-library/2.11/</tt> (you can specify another directory with a lib= option.) Then take a look at help(".libPaths") to make sure that R knows where to look for the packages you've compiled. Note that you must install packages with logged into a development node as write access to the library folder is not available to a standard node on the cluster. <br />
<!--<br />
Note that during the installation you may get warnings that the packages cannot be installed in e.g. /scinet/gpc/Applications/R/3.0.1/lib64/R/bin/. But after those messages, R should have succeeded in installing the package into your home directory.--><br />
<br />
=== Running serial R jobs ===<br />
<br />
As with all serial jobs, if your R computation does not use multiple cores, you should bundle them up so the 40 cores of a node are all performing work. Examples of this can be found on [[Running_Serial_Jobs_on_Niagara | this page]].<br />
<!--<br />
== Saving images from R in compute jobs ==<br />
<br />
To make use of the graphics capability of R, R insists on having an X server running, even if you're just writing to a file. There is no X server on the compute nodes, and you'd get a message like<br />
<br />
unable to open connection to X11 display ''<br />
<br />
To get around this issue, you can run a 'virtual' X server on the compute nodes by adding the following commands at the start of your job script:<br />
<br />
# Make virtual X server command called Xvfb available:<br />
module load Xlibraries<br />
<br />
# Select a unique display number:<br />
let DISPLAYNUM=$UID%65274<br />
export DISPLAY=":$DISPLAYNUM"<br />
<br />
# Start the virtual X server<br />
Xvfb $DISPLAY -fp $SCINET_FONTPATH -ac 2>/dev/null &<br />
<br />
After this, run R or Rscript as usual. The virtual X server will be running in the background and will get killed which your job is done. Alternatively, you may want to kill it explicitly at the end of you job script using <br />
<br />
# Kill any remaining Xvfb server<br />
pkill -u $UID Xvfb<br />
--><br />
<br />
== Using a Jupyter Notebook ==<br />
<br />
You may develop your R scripts in a Jupyter Notebook on Niagara. A node has been set aside as a Jupyter Hub. See [[Jupyter_Hub | this page]] for details on how to access that node, and develop your code.<br />
<br />
== Rmpi (R with MPI) ==<br />
<br />
None of the R installations on Niagara have Rmpi installed by default. <br />
<br />
=== Installing Rmpi, version 3.5.0 ===<br />
<br />
Version 3.5.0 does not have the Rmpi library as a standard package, which means you have to install it yourself if you are using that version. The same is true if you want to use IntelMPI instead of OpenMPI. <br />
<br />
Installing the Rmpi package can be a bit challenging, since some additional parameters need to be given to the installation, which contain the path to various header files and libraries. These paths differ depending on what MPI version you are using. <br />
<br />
The various MPI versions on Niagara are loaded with the module command. So the first thing to do is to decide what MPI version to use (OpenMPI or IntelMPI), and to type the corresponding "module load" command on the command-line (as well as in your jobs scripts).<br />
<br />
Because the MPI modules define all the paths in environment variables, the following line seem to work for installations of all OpenMPI versions.<br />
<br />
<pre><br />
$ module load intel/2018.2<br />
$ module load openmpi/3.1.0<br />
$ module load r/3.5.0<br />
$<br />
$ R<br />
><br />
> install.packages("Rmpi",<br />
configure.args =<br />
c(paste("--with-Rmpi-include=", Sys.getenv("SCINET_OPENMPI_ROOT"), "/include", sep=""),<br />
paste("--with-Rmpi-libpath=", Sys.getenv("SCINET_OPENMPI_ROOT"), "lib", sep=""),<br />
"--with-Rmpi-type=OPENMPI"))<br />
</pre><br />
<br />
For intelmpi, you only need to change <tt>OPENMPI</tt> to <tt>MPICH2</tt> in the last line.<br />
<br />
=== Running Rmpi ===<br />
<br />
To start using R with Rmpi, make sure you have all require modules loaded (e.g. <tt>module load intel/2018.2 openmpi/3.1.0 r/3.5.0</tt>), then launch it with<br />
<pre><br />
$ mpirun -np 1 R --no-save<br />
</pre><br />
which starts one master mpi process, but starts up the infrastructure to be able to spawn additional processes.<br />
<br />
== Creating an R cluster ==<br />
<br />
The 'parallel' package allows you to use R to launch individual serial subjobs across multiple nodes. This section describes how this is accomplished.<br />
<br />
=== Creating your Rscript wrapper ===<br />
<br />
The first thing to do is create a wrapper for Rscript. This needs to be done because the R module needs to be loaded on all nodes, but the submission script only loads modules on the head node of the job. The wrapper script, let's call it MyRscript.sh, is short:<br />
<pre><br />
#!/bin/bash<br />
<br />
module load intel/2019u4 r/3.6.1<br />
${SCINET_R_ROOT}/bin/Rscript --no-restore "$@"<br />
</pre><br />
The "--no-restore" flag prevents Rscript from loading your "workspace image", if you have one saved. Loading the image causes problems for the cluster.<br />
<br />
Once you've created your wrapper, make it executable:<br />
<pre><br />
$ chmod u+x MyRscript.sh<br />
</pre><br />
Your wrapper is now ready to be used.<br />
<br />
=== The cluster R code ===<br />
The R code which we will run consists of two parts, the code which launches the cluster, and does pre- and post-analysis, and the code which is run on the individual cluster "nodes". Here is some code which demonstrates this functionality. Let's call it MyClusterCode.R.<br />
<pre><br />
<br />
######################################################<br />
#<br />
# worker code<br />
#<br />
<br />
# first define the function which will be run on all the cluster nodes. This is just a test function. <br />
# Put your real worker code here.<br />
testfunc <- function(a) {<br />
<br />
# this part is just to waste time<br />
b <- 0<br />
for (i in 1:100000000) {<br />
b <- b + 1<br />
}<br />
<br />
s <- Sys.info()['nodename']<br />
return(paste0(s, " ", a[1], " ", a[2]))<br />
<br />
}<br />
<br />
<br />
######################################################<br />
#<br />
# head node code<br />
#<br />
<br />
# Create a bunch of index pairs to feed to the worker function. These could be parameters,<br />
# or whatever your code needs to vary across jobs. Note that the worker function only <br />
# takes a single argument; each entry in the list must contain all the information <br />
# that the function needs to run. In this example, each entry contains a list which<br />
# contains two pieces of information, a pair of indices.<br />
indexlist <- list()<br />
index <- 1<br />
for (i in 1:10) {<br />
for (j in 1:10) {<br />
indexlist[index] <- list(c(i,j))<br />
index <- index +1<br />
}<br />
}<br />
<br />
<br />
# Now set up the cluster.<br />
<br />
# First load the parallel library.<br />
library(parallel)<br />
<br />
# Next find all the nodes which the scheduler has given to us.<br />
# These are given by the SLURM_JOB_NODELIST environment variable.<br />
nodelist <- Sys.getenv("SLURM_JOB_NODELIST")<br />
<br />
node_ids <- unlist(strsplit(nodelist,split="[^a-z0-9-]"))[-1]<br />
<br />
if (length(node_ids)>0) {<br />
expanded_ids <- lapply(node_ids, function (id) {<br />
ranges <- as.numeric(<br />
unlist(strsplit(id, split="[-]"))<br />
)<br />
if (length(ranges)>1) seq(ranges[1], ranges[2], by=1) else ranges<br />
})<br />
<br />
nodelist <- sprintf("nia%04d", unlist(expanded_ids))<br />
}<br />
<br />
# Now launch the cluster, using the list of nodes and our Rscript<br />
# wrapper.<br />
cl <- makePSOCKcluster(names = nodelist, rscript = "/path/to/your/MyRscript.sh")<br />
<br />
# Now run the worker code, using the parameter list we created above.<br />
result <- clusterApplyLB(cl, indexlist, testfunc)<br />
<br />
# The results of all the jobs will now be put in the 'result' variable,<br />
# in the order they were specified in the 'indexlist' variable.<br />
<br />
# Don't forget to stop the cluster when you're finished.<br />
stopCluster(cl)<br />
</pre><br />
You can, of course, add any post-processing code you need to the above code.<br />
<br />
=== Submitting an R cluster job ===<br />
You are now ready to submit your job to the Niagara queue. The submission script is like most others:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --time=5:00:00<br />
#SBATCH --job-name MyRCluster<br />
<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
module load intel/2018.2 r/3.5.0<br />
<br />
${SCINET_R_ROOT}/bin/Rscript --no-restore MyClusterCode.R<br />
</pre><br />
Be sure to use whatever number of nodes, length of time, etc., is appropriate for your job.<br />
<br />
== SciNet's R Classes ==<br />
<br />
There is a dizzying amount of documentation available for programming in R; consult your favourite search engine. That begin said, SciNet runs several classes each year on using R for research:<br />
* [https://support.scinet.utoronto.ca/education/browse.php?category=-1&search=msc1090&include=all&filter=Filter MSC1090]: Introduction to Computational BioStatistics with R. This class graduate-level [https://ims.utoronto.ca IMS]-sponsored class is open to graduate students in IMS or other fields. This class is intended for those with little-to-no programming experience who wish to use R in scientific research.<br />
* [https://support.scinet.utoronto.ca/education/browse.php?category=-1&search=ees1137&include=all&filter=Filter EES1137]: Quantitative Applications for Data Analysis. [https://www.utsc.utoronto.ca/gradpes/ees1137h-quantitative-applications-data-analysis This class] is similar to MSC1090, but takes class at UTSC, and is sponsored by the department of Physical and Environmental Sciences.age].</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=R&diff=2904
R
2021-01-20T03:00:15Z
<p>Ejspence: /* Creating your Rscript wrapper */</p>
<hr />
<div>[http://www.r-project.org/ R] is a programing language that continues to grow in popularity for data analysis. It is very fast to write code in, but the software that results is much much slower than C or Fortran; one should be wary of doing too much compute-intensive work in R.<br />
<br />
==Running R on Niagara==<br />
<br />
We currently have two families of R installed on Niagara. <br />
* Anaconda R<br />
* regular R<br />
<br />
Here we describe the differences between these packages.<br />
<br />
=== Anaconda R===<br />
<br />
Anaconda is a pre-assembled set of commonly-used data science tools, which recently added R to its suite of packages. The source for this collection is [https://anaconda.org/r here]. <br />
<br />
As of 30 July 2018 the following Anaconda modules are available:<br />
<br />
$ module avail anaconda<br />
----------------- /scinet/niagara/software/2018a/modules/base ------------------<br />
anaconda2/5.1.0 python/2.7.14-anaconda5.1.0 r/3.4.3-anaconda5.1.0<br />
anaconda3/5.1.0 python/3.6.4-anaconda5.1.0<br />
<br />
Note that there is a single Anaconda R module available, and that none of these modules require a compiler to be loaded. The Anaconda R module is R version 3.4.3, which comes from the Anaconda version 5.1.0.<br />
<br />
You load the module in the usual way:<br />
<br />
$ module load r/3.4.3-anaconda5.1.0<br />
$ R<br />
><br />
<br />
=== Regular R ===<br />
<br />
The base R program has also been installed from source. This installation comes with no R packages installed other than the base installation.<br />
<br />
$ module spider r<br />
--------------------------------------------------------------------------------<br />
r:<br />
--------------------------------------------------------------------------------<br />
Versions:<br />
r/3.4.3-anaconda5.1.0<br />
r/3.5.0<br />
$<br />
$ module spider r/3.5.0<br />
--------------------------------------------------------------------------------<br />
r: r/3.5.0<br />
--------------------------------------------------------------------------------<br />
You will need to load all module(s) on any one of the lines below before the "r/3.5.0" module is available to load.<br />
intel/2018.2<br />
$<br />
$ module load intel/2018.2 r/3.5.0<br />
$ R<br />
><br />
<br />
<!--<br />
(The intel module is a prerequesite for the R module). If you will be using Rmpi, you will need to load the openmpi module as well.--><br />
<br />
Many optional packages are available for R which add functionality for specific domains; they are available through the [http://cran.r-project.org/mirrors.html Comprehensive R Archive Network (CRAN)]. <br />
<br />
R provides an easy way for users to install the libraries they need in their home directories rather than having them installed system-wide; there are so many potential optional packages for R people could potentially want, we recommend users who want additional packages to proceed this way. This is almost certainly the easiest way to deal with the wide range of packages, ensure they're up to date, and ensure that users package choices don't conflict. <br />
<br />
In general, you can install those that you need yourself in your home directory; eg, <br />
<br />
<pre><br />
$ R <br />
> install.packages("package-name", dependencies = TRUE)<br />
</pre><br />
<br />
will download and compile the source for the packages you need in your home directory under <tt>${HOME}/R/x86_64-unknown-linux-gnu-library/2.11/</tt> (you can specify another directory with a lib= option.) Then take a look at help(".libPaths") to make sure that R knows where to look for the packages you've compiled. Note that you must install packages with logged into a development node as write access to the library folder is not available to a standard node on the cluster. <br />
<!--<br />
Note that during the installation you may get warnings that the packages cannot be installed in e.g. /scinet/gpc/Applications/R/3.0.1/lib64/R/bin/. But after those messages, R should have succeeded in installing the package into your home directory.--><br />
<br />
=== Running serial R jobs ===<br />
<br />
As with all serial jobs, if your R computation does not use multiple cores, you should bundle them up so the 40 cores of a node are all performing work. Examples of this can be found on [[Running_Serial_Jobs_on_Niagara | this page]].<br />
<!--<br />
== Saving images from R in compute jobs ==<br />
<br />
To make use of the graphics capability of R, R insists on having an X server running, even if you're just writing to a file. There is no X server on the compute nodes, and you'd get a message like<br />
<br />
unable to open connection to X11 display ''<br />
<br />
To get around this issue, you can run a 'virtual' X server on the compute nodes by adding the following commands at the start of your job script:<br />
<br />
# Make virtual X server command called Xvfb available:<br />
module load Xlibraries<br />
<br />
# Select a unique display number:<br />
let DISPLAYNUM=$UID%65274<br />
export DISPLAY=":$DISPLAYNUM"<br />
<br />
# Start the virtual X server<br />
Xvfb $DISPLAY -fp $SCINET_FONTPATH -ac 2>/dev/null &<br />
<br />
After this, run R or Rscript as usual. The virtual X server will be running in the background and will get killed which your job is done. Alternatively, you may want to kill it explicitly at the end of you job script using <br />
<br />
# Kill any remaining Xvfb server<br />
pkill -u $UID Xvfb<br />
--><br />
<br />
== Using a Jupyter Notebook ==<br />
<br />
You may develop your R scripts in a Jupyter Notebook on Niagara. A node has been set aside as a Jupyter Hub. See [[Jupyter_Hub | this page]] for details on how to access that node, and develop your code.<br />
<br />
== Rmpi (R with MPI) ==<br />
<br />
None of the R installations on Niagara have Rmpi installed by default. <br />
<br />
=== Installing Rmpi, version 3.5.0 ===<br />
<br />
Version 3.5.0 does not have the Rmpi library as a standard package, which means you have to install it yourself if you are using that version. The same is true if you want to use IntelMPI instead of OpenMPI. <br />
<br />
Installing the Rmpi package can be a bit challenging, since some additional parameters need to be given to the installation, which contain the path to various header files and libraries. These paths differ depending on what MPI version you are using. <br />
<br />
The various MPI versions on Niagara are loaded with the module command. So the first thing to do is to decide what MPI version to use (OpenMPI or IntelMPI), and to type the corresponding "module load" command on the command-line (as well as in your jobs scripts).<br />
<br />
Because the MPI modules define all the paths in environment variables, the following line seem to work for installations of all OpenMPI versions.<br />
<br />
<pre><br />
$ module load intel/2018.2<br />
$ module load openmpi/3.1.0<br />
$ module load r/3.5.0<br />
$<br />
$ R<br />
><br />
> install.packages("Rmpi",<br />
configure.args =<br />
c(paste("--with-Rmpi-include=", Sys.getenv("SCINET_OPENMPI_ROOT"), "/include", sep=""),<br />
paste("--with-Rmpi-libpath=", Sys.getenv("SCINET_OPENMPI_ROOT"), "lib", sep=""),<br />
"--with-Rmpi-type=OPENMPI"))<br />
</pre><br />
<br />
For intelmpi, you only need to change <tt>OPENMPI</tt> to <tt>MPICH2</tt> in the last line.<br />
<br />
=== Running Rmpi ===<br />
<br />
To start using R with Rmpi, make sure you have all require modules loaded (e.g. <tt>module load intel/2018.2 openmpi/3.1.0 r/3.5.0</tt>), then launch it with<br />
<pre><br />
$ mpirun -np 1 R --no-save<br />
</pre><br />
which starts one master mpi process, but starts up the infrastructure to be able to spawn additional processes.<br />
<br />
== Creating an R cluster ==<br />
<br />
The 'parallel' package allows you to use R to launch individual serial subjobs across multiple nodes. This section describes how this is accomplished.<br />
<br />
=== Creating your Rscript wrapper ===<br />
<br />
The first thing to do is create a wrapper for Rscript. This needs to be done because the R module needs to be loaded on all nodes, but the submission script only loads modules on the head node of the job. The wrapper script, let's call it MyRscript.sh, is short:<br />
<pre><br />
#!/bin/bash<br />
<br />
module load intel/2019u4 r/3.6.1<br />
${SCINET_R_ROOT}/bin/Rscript --no-restore "$@"<br />
</pre><br />
The "--no-restore" flag prevents Rscript from loading your "workspace image", if you have one saved. Loading the image causes problems for the cluster.<br />
<br />
Once you've created your wrapper, make it executable:<br />
<pre><br />
$ chmod u+x MyRscript.sh<br />
</pre><br />
Your wrapper is now ready to be used.<br />
<br />
=== The cluster R code ===<br />
The R code which we will run consists of two parts, the code which launches the cluster, and does pre- and post-analysis, and the code which is run on the individual cluster "nodes". Here is some code which demonstrates this functionality. Let's call it MyClusterCode.R.<br />
<pre><br />
<br />
######################################################<br />
#<br />
# worker code<br />
#<br />
<br />
# first define the function which will be run on all the cluster nodes. This is just a test function. <br />
# Put your real worker code here.<br />
testfunc <- function(a) {<br />
<br />
# this part is just to waste time<br />
b <- 0<br />
for (i in 1:10000) {<br />
b <- b + 1<br />
}<br />
<br />
s <- Sys.info()['nodename']<br />
return(paste0(s, " ", a[1], " ", a[2]))<br />
<br />
}<br />
<br />
<br />
######################################################<br />
#<br />
# head node code<br />
#<br />
<br />
# Create a bunch of index pairs to feed to the worker function. These could be parameters,<br />
# or whatever your code needs to vary across jobs. Note that the worker function only <br />
# takes a single argument; each entry in the list must contain all the information <br />
# that the function needs to run. In this example, each entry contains a list which<br />
# contains two pieces of information, a pair of indices.<br />
indexlist <- list()<br />
index <- 1<br />
for (i in 1:10) {<br />
for (j in 1:10) {<br />
indexlist[index] <- list(c(i,j))<br />
index <- index +1<br />
}<br />
}<br />
<br />
<br />
# Now set up the cluster.<br />
<br />
# First load the parallel library.<br />
library(parallel)<br />
<br />
# Next find all the nodes which the scheduler has given to us.<br />
# These are given by the SLURM_JOB_NODELIST environment variable.<br />
nodelist <- Sys.getenv("SLURM_JOB_NODELIST")<br />
<br />
node_ids <- unlist(strsplit(nodelist,split="[^a-z0-9-]"))[-1]<br />
<br />
if (length(node_ids)>0) {<br />
expanded_ids <- lapply(node_ids, function (id) {<br />
ranges <- as.numeric(<br />
unlist(strsplit(id, split="[-]"))<br />
)<br />
if (length(ranges)>1) seq(ranges[1], ranges[2], by=1) else ranges<br />
})<br />
<br />
nodelist <- sprintf("nia%04d", unlist(expanded_ids))<br />
}<br />
<br />
# Now launch the cluster, using the list of nodes and our Rscript<br />
# wrapper.<br />
cl <- makePSOCKcluster(names = nodelist, rscript = "/path/to/your/MyRscript.sh")<br />
<br />
# Now run the worker code, using the parameter list we created above.<br />
result <- clusterApplyLB(cl, indexlist, testfunc)<br />
<br />
# The results of all the jobs will now be put in the 'result' variable,<br />
# in the order they were specified in the 'indexlist' variable.<br />
<br />
# Don't forget to stop the cluster when you're finished.<br />
stopCluster(cl)<br />
</pre><br />
You can, of course, add any post-processing code you need to the above code.<br />
<br />
=== Submitting an R cluster job ===<br />
You are now ready to submit your job to the Niagara queue. The submission script is like most others:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --time=5:00:00<br />
#SBATCH --job-name MyRCluster<br />
<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
module load intel/2018.2 r/3.5.0<br />
<br />
${SCINET_R_ROOT}/bin/Rscript --no-restore MyClusterCode.R<br />
</pre><br />
Be sure to use whatever number of nodes, length of time, etc., is appropriate for your job.<br />
<br />
== SciNet's R Classes ==<br />
<br />
There is a dizzying amount of documentation available for programming in R; consult your favourite search engine. That begin said, SciNet runs several classes each year on using R for research:<br />
* [https://support.scinet.utoronto.ca/education/browse.php?category=-1&search=msc1090&include=all&filter=Filter MSC1090]: Introduction to Computational BioStatistics with R. This class graduate-level [https://ims.utoronto.ca IMS]-sponsored class is open to graduate students in IMS or other fields. This class is intended for those with little-to-no programming experience who wish to use R in scientific research.<br />
* [https://support.scinet.utoronto.ca/education/browse.php?category=-1&search=ees1137&include=all&filter=Filter EES1137]: Quantitative Applications for Data Analysis. [https://www.utsc.utoronto.ca/gradpes/ees1137h-quantitative-applications-data-analysis This class] is similar to MSC1090, but takes class at UTSC, and is sponsored by the department of Physical and Environmental Sciences.age].</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=GROMACS&diff=2848
GROMACS
2020-10-30T13:30:29Z
<p>Ejspence: /* GROMACS 2016.5 */</p>
<hr />
<div>===GROMACS 2018.6===<br />
<br />
[http://www.gromacs.org GROMACS] is a versatile molecular dynamics package. A thorough treatment of GROMACS can be found on the [https://docs.computecanada.ca/wiki/GROMACS Compute Canada page]. Here is a sample Niagara run script:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
#<br />
#SBATCH --nodes=1<br />
#SBATCH --ntasks-per-node=10<br />
#SBATCH --cpus-per-task=4<br />
#SBATCH --time=11:00:00<br />
#SBATCH --job-name test<br />
<br />
module load intel/2019u3<br />
module load intelmpi/2019u3<br />
module load gromacs/2018.6<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"<br />
<br />
srun gmx_mpi mdrun -deffnm md<br />
</source><br />
<br />
The above script requests 1 node and runs gromacs in a hybrid mode, in 10 groups of 4 processors.<br />
<br />
Note that GROMACS is well-suited to use on GPUs, of which Niagara has none. Running on [[Mist]] is recommended. Alternatively, running on CC systems which have GPUs, such as [https://docs.computecanada.ca/wiki/Graham Graham] and [https://docs.computecanada.ca/wiki/Cedar Cedar], is also an option.<br />
<br />
=== Gromacs on Mist GPU cluster ===<br />
See details on Mist page: [https://docs.scinet.utoronto.ca/index.php/Mist#Gromacs Gromacs on Mist]</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Running_Serial_Jobs_on_Niagara&diff=2818
Running Serial Jobs on Niagara
2020-09-03T12:09:41Z
<p>Ejspence: /* Version for more than 1 node at once */</p>
<hr />
<div>===General considerations===<br />
<br />
====Use whole nodes...====<br />
<br />
When you submit a job to Niagara, it is run on one (or more than one) entire node - meaning that your job is occupying at least 40 processors for the duration of its run. The SciNet systems are usually fully utilized, with many researchers waiting in the queue for computational resources, so we require that you make full use of the nodes that your job is allocated, so other researchers don't have to wait unnecessarily, and so that your jobs get as much work done as possible.<br />
<br />
Often, the best way to make full use of the node is to run one large parallel computation; but sometimes it is beneficial to run several serial codes at the same time. On this page, we discuss ways to run suites of serial computations at once, as efficiently as possible, using the full resources of the node.<br />
<br />
====... memory permitting====<br />
<br />
When running multiple serial jobs on the same node, it is essential to have a good idea of how much memory the jobs will require. The Niagara compute nodes have about 200GB of memory available to user jobs running on the 40 cores, i.e., a bit over 4GB per core. So the jobs also have to be bunched in ways that will fit into 200GB. If they use more than this, it will crash the node, inconveniencing you and other researchers waiting for that node.<br />
<br />
If 40 serial jobs would not fit within the 200GB limit -- i.e. each individual job requires significantly in excess of ~4GB -- then it's allowed to just run fewer jobs so that they do fit. Note that in that case, the jobs are likely candidates for parallelization, and you can contact us at [mailto:support@scinet.utoronto.ca <support@scinet.utoronto.ca>] and arrange a meeting with one of the technical analysts to help you with that.<br />
<br />
If the memory requirements allow it, you could actually run more than 40 jobs at the same time, up to 80, exploiting the [[Niagara_Quickstart#Hyperthreading:_Logical_CPUs_vs._cores | HyperThreading]] feature of the Intel CPUs. It may seem counter-intuitive, but running 80 simultaneous jobs on 40 cores for certain types of tasks has increased some users overall throughput.<br />
<br />
====Is your job really serial?====<br />
<br />
While your program may not be explicitly parallel, it may use some of Niagara's threaded libraries for numerical computations, which can make use of multiple processors. In particular, Niagara's [[Python]] and [[R_Statistical_Package | R]] modules are compiled with aggressive optimization and using threaded numerical libraries which by default will make use of multiple cores for computations such as large matrix operations. This can greatly speed up individual runs, but by less (usually much less) than a factor of 40. If you do have many such threaded computations to do, you often get more calculations done per unit time if you turn off the threading and run multiple such computations at once (provided that fits in memory, as explained above). You can turn off threading of these libraries with the shell script line <tt>export OMP_NUM_THREADS=1</tt>; that line will be included in the scripts below. <br />
<br />
If your calculations implicitly use threading, you may want to experiment to see what gives you the best performance - you may find that running 4 (or even 8) jobs with 10 threads each (<tt>OMP_NUM_THREADS=10</tt>), or 2 jobs with 20 threads, gives better performance than 40 jobs with 1 thread (and almost certainly better than 1 job with 40 threads). We'd encourage to you to perform exactly such a scaling test to find the combination of number of threads per process and processes per job that maximizes your throughput; for a small up-front investment in time you may significantly speed up all the computations you need to do.<br />
<br />
===Serial jobs of similar duration===<br />
<br />
The most straightforward way to run multiple serial jobs is to bunch the serial jobs in groups of 40 or more that will take roughly the same amount of time, and create a job script that looks a <br />
bit like this<br />
<source lang="bash"><br />
#!/bin/bash<br />
# SLURM submission script for multiple serial jobs on Niagara<br />
#<br />
#SBATCH --nodes=1<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --time=1:00:00<br />
#SBATCH --job-name serialx40<br />
<br />
# Turn off implicit threading in Python, R<br />
export OMP_NUM_THREADS=1<br />
<br />
# EXECUTION COMMAND; ampersand off 40 jobs and wait<br />
(cd serialjobdir01 && ./doserialjob01 && echo "job 01 finished") &<br />
(cd serialjobdir02 && ./doserialjob02 && echo "job 02 finished") &<br />
(cd serialjobdir03 && ./doserialjob03 && echo "job 03 finished") &<br />
(cd serialjobdir04 && ./doserialjob04 && echo "job 04 finished") &<br />
(cd serialjobdir05 && ./doserialjob05 && echo "job 05 finished") &<br />
(cd serialjobdir06 && ./doserialjob06 && echo "job 06 finished") &<br />
(cd serialjobdir07 && ./doserialjob07 && echo "job 07 finished") &<br />
(cd serialjobdir08 && ./doserialjob08 && echo "job 08 finished") &<br />
(cd serialjobdir09 && ./doserialjob09 && echo "job 09 finished") &<br />
(cd serialjobdir10 && ./doserialjob10 && echo "job 10 finished") &<br />
(cd serialjobdir11 && ./doserialjob11 && echo "job 11 finished") &<br />
(cd serialjobdir12 && ./doserialjob12 && echo "job 12 finished") &<br />
(cd serialjobdir13 && ./doserialjob13 && echo "job 13 finished") &<br />
(cd serialjobdir14 && ./doserialjob14 && echo "job 14 finished") &<br />
(cd serialjobdir15 && ./doserialjob15 && echo "job 15 finished") &<br />
(cd serialjobdir16 && ./doserialjob16 && echo "job 16 finished") &<br />
(cd serialjobdir17 && ./doserialjob17 && echo "job 17 finished") &<br />
(cd serialjobdir18 && ./doserialjob18 && echo "job 18 finished") &<br />
(cd serialjobdir19 && ./doserialjob19 && echo "job 19 finished") &<br />
(cd serialjobdir20 && ./doserialjob20 && echo "job 20 finished") &<br />
(cd serialjobdir21 && ./doserialjob21 && echo "job 21 finished") &<br />
(cd serialjobdir22 && ./doserialjob22 && echo "job 22 finished") &<br />
(cd serialjobdir23 && ./doserialjob23 && echo "job 23 finished") &<br />
(cd serialjobdir24 && ./doserialjob24 && echo "job 24 finished") &<br />
(cd serialjobdir25 && ./doserialjob25 && echo "job 25 finished") &<br />
(cd serialjobdir26 && ./doserialjob26 && echo "job 26 finished") &<br />
(cd serialjobdir27 && ./doserialjob27 && echo "job 27 finished") &<br />
(cd serialjobdir28 && ./doserialjob28 && echo "job 28 finished") &<br />
(cd serialjobdir29 && ./doserialjob29 && echo "job 29 finished") &<br />
(cd serialjobdir30 && ./doserialjob30 && echo "job 30 finished") &<br />
(cd serialjobdir31 && ./doserialjob31 && echo "job 31 finished") &<br />
(cd serialjobdir32 && ./doserialjob32 && echo "job 32 finished") &<br />
(cd serialjobdir33 && ./doserialjob33 && echo "job 33 finished") &<br />
(cd serialjobdir34 && ./doserialjob34 && echo "job 34 finished") &<br />
(cd serialjobdir35 && ./doserialjob35 && echo "job 35 finished") &<br />
(cd serialjobdir36 && ./doserialjob36 && echo "job 36 finished") &<br />
(cd serialjobdir37 && ./doserialjob37 && echo "job 37 finished") &<br />
(cd serialjobdir38 && ./doserialjob38 && echo "job 38 finished") &<br />
(cd serialjobdir39 && ./doserialjob39 && echo "job 39 finished") &<br />
(cd serialjobdir40 && ./doserialjob40 && echo "job 40 finished") &<br />
wait<br />
</source><br />
<br />
There are four important things to take note of here. First, the <tt>'''wait'''</tt><br />
command at the end is crucial; without it the job will terminate <br />
immediately, killing the 40 programs you just started.<br />
<br />
Second is that every serial job is running in its own directory; this is important because writing to the same directory from different processes can lead to slow down because of directory locking. How badly your job suffers from this depends on how much I/O your serial jobs are doing, but with 40 jobs on a node, it can quickly add up.<br />
<br />
Third is that it is important to group the programs by how long they <br />
will take. If (say) <tt>dojob08</tt> takes 2 hours and the rest only take 1, <br />
then for one hour 39 of the 40 cores on that Niagara node are wasted; they are <br />
sitting idle but are unavailable for other users, and the utilization of <br />
this node over the whole run is only 51%. This is the sort of thing <br />
we'll notice, and users who don't make efficient use of the machine will <br />
have their ability to use Niagara resources reduced. If you have many serial jobs of varying length, <br />
use the submission script to balance the computational load, as explained [[ #Serial jobs of varying duration | below]].<br />
<br />
Fourth, if memory requirements allow it, you should try to run more than 40 jobs at once, with a maximum of 80 jobs.<br />
<br />
Finally, writing out 80 cases (or even just 40, as in the above example) can become highly tedious, as can keeping track of all these subjobs. You should consider using a tool that automates this, like:<br />
<br />
===GNU Parallel===<br />
<br />
GNU parallel is a really nice tool written by Ole Tange to run multiple serial jobs in<br />
parallel. It allows you to keep the processors on each 40-core node busy, if you provide enough jobs to do.<br />
<br />
GNU parallel is accessible on Niagara in the module<br />
<tt>gnu-parallel</tt>:<br />
<source lang="bash"><br />
module load NiaEnv/2019b gnu-parallel<br />
</source><br />
This also switches to the newer NiaEnv/2019b stack. The current version of the GNU parallel module in that stack is 20191122. In the older stack, NiaEnv/2018a (which is loaded by default), the version of GNU parallel is 20180322. <br />
<br />
The command <tt>man parallel_tutorial</tt> shows much of GNU parallel's functionality, while <tt>man parallel</tt> gives the details of its syntax.<br />
<br />
The citation for GNU Parallel is: O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014.<br />
<br />
It is easiest to demonstrate the usage of GNU parallel by<br />
examples. First, suppose you have 80 jobs to do (similar to the above case), and that these jobs duration varies quite a bit, but that the average job duration is around 5 hours. You could use the following script (but don't, see below):<br />
<source lang="bash"><br />
#!/bin/bash<br />
# SLURM submission script for multiple serial jobs on Niagara<br />
#<br />
#SBATCH --nodes=1<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --time=12:00:00<br />
#SBATCH --job-name gnu-parallel-example<br />
<br />
# Turn off implicit threading in Python, R<br />
export OMP_NUM_THREADS=1<br />
<br />
module load NiaEnv/2019b gnu-parallel<br />
<br />
# EXECUTION COMMAND - DON'T USE THIS ONE<br />
parallel -j $SLURM_TASKS_PER_NODE <<EOF<br />
cd serialjobdir01 && ./doserialjob01 && echo "job 01 finished"<br />
cd serialjobdir02 && ./doserialjob02 && echo "job 02 finished"<br />
...<br />
cd serialjobdir80 && ./doserialjob80 && echo "job 80 finished"<br />
EOF<br />
</source><br />
<br />
The <tt>-j $SLURM_TASKS_PER_NODE</tt> parameter sets the number of jobs to run at the same time on each compute node, and is using the slurm value, which coincides with the <tt>--ntasks-per-node</tt> parameter. For gpu-parallel modules starting from version 20191122, if you omit the option <tt>-j $SLURM_TASKS_PER_NODE</tt>, you will get as many simultaneous subjobs as the <tt>ntask-per-node</tt> parameter you specify in the <tt>#SBATCH</tt> part of the jobs script.<br />
<br />
Each line in the input given to parallel is a separate subjob, so 80 jobs are lined up to run. Initially, 40 subjobs are given to the 40 processors on the node. When one of the processors is done with its assigned subjob, it will get a next subjob instead of sitting idle until the other processors are done. While you would expect that on average this script should take 10 hours (each processor on average has to complete two jobs of 5 hours), there's a good chance that one of the processors gets two jobs that take more than 5 hours, so the job script requests 12 hours to be safe. How much more time you should ask for in practice depends on the spread in expected run times of the separate jobs.<br />
<br />
===Serial jobs of varying duration===<br />
<br />
The script above works, and can be extended to more subjobs, which is especially important if you have to do a lot (100+) of relatively short serial runs '''of which the walltime varies'''. But it gets tedious to write out all the cases. You could write a script to automate this, but you do not have to, because GNU Parallel already has ways of generating subjobs, as we will show below.<br />
<br />
GNU Parallel can also keep track of the subjobs with succeeded, failed, or never started. For that, you just add <tt>--joblog</tt> to the parallel command followed by a filename to which to write the status:<br />
<br />
<source lang="bash" line start=17><br />
# EXECUTION COMMAND - DON'T USE THIS ONE<br />
parallel --joblog slurm-$SLURM_JOBID.log -j $SLURM_TASKS_PER_NODE <<EOF<br />
cd serialjobdir01 && ./doserialjob01<br />
cd serialjobdir02 && ./doserialjob02<br />
...<br />
cd serialjobdir80 && ./doserialjob80<br />
EOF<br />
</source><br />
<br />
In this case, the job log gets written to "slurm-$SLURM_JOBID.log", where "<tt>$SLURM_JOBID</tt>" will be replaced by the job number. The joblog can also be used to retry failed jobs (more below).<br />
<br />
Second, we can generate that set of subjobs instead of writing them out by hand. The following does the trick:<br />
<br />
<source lang="bash" line start=17><br />
# EXECUTION COMMAND <br />
parallel --joblog slurm-$SLURM_JOBID.log -j $SLURM_TASKS_PER_NODE "cd serialjobdir{} && ./doserialjob{}" ::: {01..80}<br />
</source><br />
<br />
This works as follows: <tt>"cd serialjobdir{} && ./doserialjob{}"</tt> is a template command, with placeholders {}. <tt>:::</tt> indicated that a set of parameters follows that are to be put into the template, thus generating the commands for each subjob. After the <tt>:::</tt> we can place a space-separated set of arguments, which in this case are generated using the bash-specific construct for a range, <tt>{01..80}</tt>.<br />
<br />
The final script now looks like this:<br />
<source lang="bash"><br />
#!/bin/bash<br />
# SLURM submission script for multiple serial jobs on Niagara<br />
#<br />
#SBATCH --nodes=1<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --time=12:00:00<br />
#SBATCH --job-name gnu-parallel-example<br />
<br />
# Turn off implicit threading in Python, R<br />
export OMP_NUM_THREADS=1<br />
<br />
module load NiaEnv/2019b gnu-parallel <br />
<br />
# EXECUTION COMMAND <br />
parallel --joblog slurm-$SLURM_JOBID.log "cd serialjobdir{} && ./doserialjob{}" ::: {01..80}<br />
</source><br />
<br />
Notes:<br />
* As before, GNU Parallel keeps 40 jobs running at a time, and if one finishes, starts the next. This is an easy way to do ''load balancing''.<br />
* The <tt>-j</tt> option was omitted, which works if using GNU Parallel module version 20191122 or higher. Otherwise, you need to add the <tt>-j $SLURM_TASKS_PER_NODE</tt> flag to the parallel command. <br />
* Doing many serial jobs often entails doing many disk reads and writes, which can be detrimental to the performance. In that case, running from the ramdisk may be an option. <br />
** When using a ramdisk, make sure you copy your results from the ramdisk back to the scratch after the runs, or when the job is killed because time has run out.<br />
** More details on how to setup your script to use the ramdisk can be found on the [[User_Ramdisk | Ramdisk page]].<br />
* This script optimizes resource utility, but can only use 1 node (40 cores) at a time. The next section addresses how to use more nodes.<br />
* While on the command line, the option "--bar" can be nice to see the progress, when running as a job, you would not see this status bar. <br />
* The <tt>--joblog</tt> parameter also keeps track of failed or unfinished jobs, so you can later try to redo those with the same command, but with the option "--resume" added.<br />
<br />
===Version for more than 1 node at once===<br />
<br />
If you have many hundreds of serial jobs that you want to run concurrently and the nodes are available, then the approach above, while useful, would require tens of scripts to be submitted separately. Alternatively, it is possible to request more than one node and to use the following routine to distribute your processes amongst the cores.<br />
<br />
Although it is not recommended to use GNU parallel modules before version 20191122, if you do, the script should look like this:<br />
<source lang="bash"><br />
#!/bin/bash<br />
# SLURM submission script for multiple serial jobs on multiple Niagara nodes<br />
#<br />
#SBATCH --nodes=4<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --time=12:00:00<br />
#SBATCH --job-name gnu-parallel-multinode-example<br />
<br />
# Turn off implicit threading in Python, R<br />
export OMP_NUM_THREADS=1<br />
<br />
module load gnu-parallel<br />
<br />
HOSTS=$(scontrol show hostnames $SLURM_NODELIST | tr '\n' ,)<br />
NCORES=40<br />
<br />
parallel --env OMP_NUM_THREADS,PATH,LD_LIBRARY_PATH --joblog slurm-$SLURM_JOBID.log -j $NCORES -S $HOSTS --wd $PWD "cd serialjobdir{} && ./doserialjob{}" ::: {001..800}<br />
<br />
</source><br />
<br />
* The parameter <tt>-S $HOSTS</tt> divides the work over different nodes. <tt>$HOSTS</tt> should be a comma separated list of the node names. These node names are also stored in <tt>$SLURM_NODELIST</tt>, but with a syntax that allows for ranges, which GNU parallel does not understand. The <tt>scontrol</tt> command in the script above fixes that.<br />
* Alternatively, GNU Parallel can be passed a file with the list of nodes to which to ssh, using <tt>--sshloginfile</tt>, but your jobs script would first have to create that file.<br />
* The parameter <tt>-j $NCORES</tt> tells <tt>parallel</tt> to run 40 subjobs simultaneously on each of the nodes (note: do not use the similarly named variable $SLURM_TASKS_PER_NODE as its format is incompatible with GNU parallel).<br />
* The parameter <tt>--wd $PWD</tt> sets the working directory on the other nodes to the working directory on the first node. <span style="color:red;">The <tt>--wd</tt> argument is essential:</span> without this, the run tries to start from the wrong place and will most likely fail.<br />
* If you need an environment variable to be transfered from the job script to the remotely running subjobs, use the <tt>--env ENVIRONMENTVARIABLE</tt> argument for the parallel command. The example above copies the most common variables that a remote command may need.<br />
<br />
Instead of this script using an old version of GNU parallel, we recommend using GNU parallel modules starting from version 20191122 that is available in NiaEnv/2019b, <br />
which facilitate automatic distribution of subjobs over nodes. For these newer versions of the module, the script can look like this:<br />
<source lang="bash"><br />
#!/bin/bash<br />
# SLURM submission script for multiple serial jobs on multiple Niagara nodes<br />
#<br />
#SBATCH --nodes=4<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --time=12:00:00<br />
#SBATCH --job-name gnu-parallel-multinode-example<br />
<br />
# Turn off implicit threading in Python, R<br />
export OMP_NUM_THREADS=1<br />
<br />
module load NiaEnv/2019b gnu-parallel<br />
<br />
parallel --joblog slurm-$SLURM_JOBID.log --wd $PWD "cd serialjobdir{} && ./doserialjob{}" ::: {001..800}<br />
<br />
</source><br />
* The mechanism of the automation of the number of tasks per nodes and the node names that GNU Parallel can use, is all through the environment variable <tt>$PARALLEL</tt>, which is set by the gnu-parallel module.<br />
* The parameter <tt>--wd $PWD</tt> sets the working directory on the other nodes to the working directory on the first node. <span style="color:red;">The <tt>--wd</tt> argument is essential:</span> without this, the run tries to start from the wrong place and will most likely fail.<br />
* If you need an environment variable to be transfered from the job script to the remotely running subjobs, use the <tt>--env ENVIRONMENTVARIABLE</tt> argument for the parallel command. The <tt>$PARALLEL</tt> environment variable is already set to copy the most common variables <tt>$PATH, $LD_LIBRARY_PATH, and $OMP_NUM_THREADS</tt>.<br />
<br />
Of course, this is just an example of what you could do with gnu parallel. How you set up your specific run depends on how each of the runs would be started. One could for instance also prepare a file of commands to run and make that the input to parallel as well.<br />
<br />
Submitting several bunches to single nodes, as in the section above, is a more fail-safe way of proceeding, since a node failure would only affect one of these bunches, rather than all runs. <br />
<br />
We reiterate that if memory requirements allow it, you should try to run more than 40 jobs at once, with a maximum of 80 jobs. The way the above example job script are written, you simple change <tt>#SBATCH --ntasks-per-node=40</tt> to <tt>#SBATCH --ntasks-per-node=80</tt> to accomplish this.<br />
<br />
===More on GNU parallel=== <br />
* The documentation for GNU parallel can be found at http://www.gnu.org/software/parallel/ .<br />
* After loading the <tt>gnu-parallel</tt> module, type <tt>man parallel_tutorial</tt><br />
* After loading the <tt>gnu-parallel</tt> module, type <tt>man parallel</tt><br/>The man page can also be found at http://www.gnu.org/software/parallel/man.html .<br />
<br />
===GNU Parallel Reference===<br />
<br />
The author of GNU parallel request that when using GNU parallel for a publication, you please cite:<br />
<br />
* O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014.</div>
Ejspence
https://docs.scinet.utoronto.ca/index.php?title=Parallel_Debugging_with_DDT&diff=2811
Parallel Debugging with DDT
2020-09-01T01:06:46Z
<p>Ejspence: /* Setting up the client side */</p>
<hr />
<div>==ARM DDT Parallel Debugger==<br />
<br />
For parallel debugging, SciNet has DDT ("Distributed Debugging Tool") installed on all our clusters. DDT is a powerful, GUI-based commercial debugger by ARM (formerly by Allinea). It supports the programming languages C, C++, and Fortran, and the parallel programming paradigms MPI, OpenMPI, and CUDA. DDT can also be very useful for serial programs. DDT provides a nice, intuitive graphical user interface. It does need graphics support, so make sure to use the '-X' or '-Y' arguments to your ssh commands, so that X11 graphics can find its way back to your screen ("X forwarding").<br />
<br />
The most currently installed version of ddt on [[Niagara_Quickstart | Niagara]] is DDT 19.1. The ddt license allows up to a total of 128 processes to be debugged simultaneously (shared among all users).<br />
<br />
To use ddt, ssh in with X forwarding enabled, load your usual compiler and mpi modules, compile your code with '-g' and load the module<br />
<br />
<code>module load ddt</code><br />
<br />
You can then start ddt with one of the following commands:<br />
<br />
<code>ddt</code><br />
<br />
<code>ddt <executable compiled with -g flag> </code><br />
<br />
<code>ddt <executable compiled with -g flag> <arguments> </code><br />
<br />
<code>ddt -n <numprocs> <executable compiled with -g flag> <arguments> </code><br />
<br />
The first time you run DDT, it will set up configuration files. It puts these in the hidden directory $SCRATCH/.allinea. <br />
<br />
Note that most users will debug on the login nodes of the a clusters (nia-login0{1-3,5-7}), but that this is only appropriate if the number of mpi processes and threads is small, and the memory usage is not too large. If your debugging requires more resources, you should run it through the queue. On Niagara, an interactive debug session will suit most debugging purposes.<br />
<br />
==ARM MAP Parallel Profiler==<br />
<br />
MAP is a parallel (MPI) performance analyser with a graphical interface. It is part of the same DDT module, so you need to load <tt>ddt</tt> to use MAP (together, DDT and MAP form the <i>ARM Forge</i> bundel).<br />
<br />
It has a similar job startup interface as DDT. <br />
<br />
To be more precise, MAP is a sampling profiler with adaptive sampling rates to keep the<br />
data volumes collected under control. Samples are aggregated at all levels to preserve key features of<br />
a run without drowning in data. A folding code and stack viewer allows you to zoom into time<br />
spent on individual lines and draw back to see the big picture across nests of routines. MAP measures memory usage, floating-point calculations, MPI usage, as well as I/O.<br />
<br />
The maximum number of MPI processes for that our MAP license supports is 64 (simultaneously shared among all users).<br />
<br />
It supports both interactive and batch modes for gathering profile data.<br />
<br />
===Interactive profiling with MAP===<br />
<br />
Startup is much the same as for DDT:<br />
<br />
<code>map</code><br />
<br />
<code>map <executable compiled with -g flag> </code><br />
<br />
<code>map <executable compiled with -g flag> <arguments> </code><br />
<br />
<code>map -n <numprocs> <executable compiled with -g flag> <arguments> </code><br />
<br />
After you have started the code and it has run to completion, MAP will show the results. It will also save these results in a file with the extension <tt>.map</tt>. This allows you to load the result again into the graphical user interface at a later time.<br />
<br />
===Non-interactive profiling with MAP===<br />
<br />
It is also possible to run map non-interactively by passing the <tt>-profile</tt> flag, e.g.<br />
<br />
<code>map -profile -n <numprocs> <executable compiled with -g flag> <arguments> </code><br />
<br />
For instance, this could be used in a job when it is launched with a jobscript like<br />
<br />
<source lang="bash">#!/bin/bash <br />
#SBATCH --nodes=1<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --time=1:00:00<br />
#SBATCH --job-name=mpi_job<br />
#SBATCH --output=mpi_output_%j.txt<br />
#SBATCH --mail-type=FAIL<br />
<br />
module load intel/2018.2<br />
module load openmpi/3.1.0<br />
module load ddt<br />
<br />
map -profile -n $SLURM_NTASKS ./mpi_example<br />
</source><br />
<br />
This will just create the <tt>.map</tt> file, which you could inspect after the job has finished with<br />
<br />
<code>map MAPFILE</code><br />
<br />
==Parallel Debugging and Profiling in an Interactive Session on Niagara==<br />
<br />
By requesting a job from the 'debug' partition on Niagara, you can have access to at most 4 nodes, i.e., a total of 160 physical cores (or 320 virtual cores, using hyper-threading), for your exclusive, interactive use. Starting from a Niagara login node, you would request a debug sessions with the following command:<br />
<br />
<code>debugjob <numberofnodes></code><br />
<br />
where <tt><numberofnodes></tt> is 1, 2, 3, or 4. The sessions will last 60, 45, 30, or 15 minutes, depending on the number of nodes requested.<br />
<br />
This command will get you a prompt on a compute node (or on the 'head' node if you've asked for more than one node). Reload any modules that your application needs (e.g. <tt>module load intel openmpi</tt>), as well as the <tt>ddt</tt> module.<br />
<br />
Note that on compute nodes, $HOME is read-only, so unless your code is on $SCRATCH, you cannot recompile it (with '-g') in the debug session; this should have been done on a login node.<br />
<br />
If the time restrictions of these debugjobs is too great, you need to request nodes from the regular queue. In that case, you want to make sure that you get [[Testing_With_Graphics|X11 graphics forwarded properly]].<br />
<br />
Within this debugjob session, you can then use the <tt>ddt</tt> and <tt>map</tt> commands.<br />
<br />
==Setting up a client-server connection==<br />
<br />
If you're working from home, or any other location where there isn't a fast internet connection, it is likely to be advantageous to run DDT or MAP in client-server mode. This keeps the bulk of the computation on Niagara or Mist (the server), while sending only the minimum amount of information over the internet to your locally-running version of DDT (the client).<br />
<br />
===Setting up the server side===<br />
<br />
The first step is to connect to Niagara (or Mist), and start a debug session<br />
<br />
ejspence@nia-login01 $ debugjob -N 1<br />
debugjob: Requesting 1 node(s) with 40 core(s) for 60 minutes and 0 seconds<br />
SALLOC: Granted job allocation 3995470<br />
SALLOC: Waiting for resource configuration<br />
SALLOC: Nodes nia0003 are ready for job<br />
ejspence@nia0003 $<br />
<br />
This will start an interactive debug session, on a single node, for an hour. Be sure to note the node which you have been allocated (nia003 in this case).<br />
<br />
The next step is to determine the path to DDT. To do this you will need load the DDT module:<br />
<br />
ejspence@nia0003 $ module load NiaEnv/2019b<br />
ejspence@nia0003 $ module load ddt/19.1<br />
ejspence@nia0003 $<br />
ejspence@nia0003 $ echo $SCINET_DDT_ROOT<br />
/scinet/niagara/software/2019b/opt/base/ddt/19.1<br />
ejspence@nia0003 $<br />
<br />
The next step is to create a startup script which will be run by the server, in case you are running on multiple nodes:<br />
<br />
#!/bin/bash<br />
module purge<br />
module load NiaEnv/2019b<br />
module load gcc/8.3.0 openmpi/4.0.1 ddt/19.1<br />
export ARM_TOOLS_CONFIG_DIR=${SCRATCH}/.arm<br />
mkdir -p ${ARM_TOOLS_CONFIG_DIR}<br />
export OMPI_MCA_pml=ob1<br />
<br />
Be sure to load whatever modules your code needs to run. Let us assume that the PATH to this script is $SCRATCH/ddt_remote_setup.sh.<br />
<br />
This completes the setup of the server side. There is no need to launch the server, the client itself will do this.<br />
<br />
===Setting up the client side===<br />
<br />
You now need to setup the client on your local machine (desktop or laptop). The first step is to go to [https://developer.arm.com/tools-and-software/server-and-hpc/downloads/arm-forge/older-versions-of-remote-client-for-arm-forge this page] to download the Arm Forge client. Note that this page is for older versions of DDT. This is because the client and the server must be running the same version of DDT, and the version on Niagara is 19.1. Download the version of the client appropriate for your local machine, and install it.<br />
<br />
Now launch Arm Forge. You will see a screen similar to below. Select "Remote Launch", then "Configure".<br />
<br />
{| align="center"<br />
| [[File:DDT openning.png|480px|]]<br />
|}<br />
<br />
You will see that there are no sessions already configured. Click on "Add" to create a new session configuration.<br />
<br />
{| align="center"<br />
| [[File:DDT sessions.png|480px|]]<br />
|}<br />
<br />
Next, fill in the details of the session. You need to fill in <br />
* the name of the session,<br />
* the host name, consisting of<br />
** your login credentials for Niagara (or Mist),<br />
** a space,<br />
** you user name and the node you are using (nia0003 in this example),<br />
* the installation directory of DDT on Niagara,<br />
* the location of your startup script.<br />
<br />
{| align="center"<br />
| [[File:DDT settings.png|540px|]]<br />
|}<br />
<br />
After you've entered the settings, click on "OK". This should bring you to the screen seen below.<br />
<br />
{| align="center"<br />
| [[File:DDT sessions2.png|480px|]]<br />
|}<br />
<br />
The openning screen should now look like below.<br />
<br />
{| align="center"<br />
| [[File:DDT openning2.png|480px|]]<br />
|}<br />
<br />
Click on the session you'd like to launch. In this example, "DDT Test". This will bring you to DDT's launch screen, which is the same you'll see when you run DDT normally. Note that the code and files you will be testing must be hosted on Niagara, not on your local machine.</div>
Ejspence