https://docs.scinet.utoronto.ca/api.php?action=feedcontributions&user=Northrup&feedformat=atomSciNet Users Documentation - User contributions [en]2024-03-28T08:12:07ZUser contributionsMediaWiki 1.35.12https://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3233Main Page2021-09-27T21:04:52Z<p>Northrup: </p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up|HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<b>Mon Sep 27 16:11 EDT 2021 </b> HPSS is back online.<br />
<br />
<b>Wed Sep 23 17:23 EDT 2021 </b> Systems being brought back online. HPSS may be down for some more days. <br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3232Main Page2021-09-27T21:04:28Z<p>Northrup: </p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{Up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up|HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up |Globus |Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<b>Mon Sep 27 16:11 EDT 2021 </b> HPSS is back online.<br />
<br />
<b>Wed Sep 23 17:23 EDT 2021 </b> Systems being brought back online. HPSS may be down for some more days. <br />
<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Previous_messages&diff=3231Previous messages2021-09-27T21:04:16Z<p>Northrup: </p>
<hr />
<div><br />
<b>Wed Sep 23 12:30 EDT 2021 </b> Cooling restored. Systems should be available later this afternoon. <br />
<br />
<b>Wed Sep 23 9:30 EDT 2021 </b> Technicians on site working on cooling system. <br />
<br />
<b>Wed Sep 23 3:30 EDT 2021 </b> Cooling system issues still unresolved. <br />
<br />
<b>Wed Sep 22 23:27:48 EDT 2021 </b> Shutdown of the datacenter due to a problem with the cooling system.<br />
<br />
<b>Wed Sep 22 09:30 EDT 2021 </b>: File system issues, resolved.<br />
<br />
<b>Wed Sep 22 07:30 EDT 2021 </b>: File system issues, investigating.<br />
<br />
<b>Sun Sep 19 10:00 EDT 2021</b>: Power glitch interrupted all compute jobs; please resubmit any jobs you had running.<br />
<br />
<b>Wed Sep 15 17:35 EDT 2021</b>: filesystem issues resolved<br />
<br />
<b>Wed Sep 15 16:39 EDT 2021</b>: filesystem issues<br />
<br />
<b>Mon Sep 13 13:15:07 EDT 2021</b> HPSS is back online.<br />
<br />
<b>Fri Sep 10 17:57:23 EDT 2021</b> HPSS is offline due to unscheduled maintenance.<br />
<br />
<b>Wed Aug 18 16:13:42 EDT 2021</b> The HPSS upgrade is complete.<br />
<br />
<b>HPSS Downtime August 17th and 18th, 2021 (Tuesday and Wednesday):</b> We'll be upgrading the HPSS software to version 8.3, along with all the clients (htar/hsi, vfs and Globus/dsi)<br />
<br />
<b>July 24, 2021, 6:00 PM EDT:</b> There appear to be file system issues, which may affect users' ability to login. We are investigating.<br />
<br />
<b> July 23th, 2021, 9:00 AM EDT:</b> <b> Security update: </b> Due to a severe vulnerability in the Linux kernel (CVE-2021-33909), our team is currently patching and rebooting all login nodes and compute nodes, as well as the JupyterHub. There should be no affect on running jobs, however sessions on login and datamover nodes will be disrupted. <br />
<br />
<b> July 20th, 2021, 7:00 PM EDT:</b> <b> SLURM configuration</b> - Changed the default behaviour to kill a job step if any task exits with a non-zero exit code. If your code is able to handle failures gracefully, please add srun's option --no-kill to recover the previous default behaviour.<br />
<br />
<b> July 20th, 2021, 7:00 PM EDT:</b> Maintenance finished, systems are back online. <br />
<br />
<b>SciNet Downtime July 20th, 2021 (Tuesday):</b> There will be a maintenance shutdown of the SciNet data center on Tuesday July 20th, starting at 7 am EDT. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time. We expect to be able to bring the systems back online in the evening of July 20th. The status of the Niagara cluster can be checked on status.computecanada.ca. For up-to-date and more detailed information on the status of all the SciNet systems, you can always check back here.<br />
<br />
<b>June 28th, 2021, 4:06 PM:</b> Mist OS upgrade is complete.<br />
<br />
<b>May 27, 2021:</b> Datamovers addresses have changed to improve high bandwidth connectivity and cybersecurity. The new addresses are 142.1.174.227 for nia-datamover1.scinet.utoronto.ca, and 142.1.174.228 for nia-datamover2.scinet.utoronto.ca.<br />
<br />
If you have jobs that need to connect to a software license server using an ssh tunnel through nia-gw (which actually resolves to datamover1 or datamover2), you may need to ask the system administrators of that license server to allow incoming connections from the new addresses above.<br />
<b>June 29th, 2021, 2:00 PM:</b> Thunderstorm-related power fluctuations are causing some Niagara compute nodes and their jobs to crash. Please resubmit if your jobs seem to have crashed for no apparent reason.<br />
<br />
<b>June 28th, 2021, 4:06 PM:</b> Mist OS upgrade is complete.<br />
<br />
<b>June 28th, 2021, 9:00 AM:</b> Mist is under maintenance. OS upgrading from RHEL 7 to 8.<br />
<br />
<b>June 11th, 2021, 8:30 AM:</b> Maintenance complete. Systems are up.<br />
<br />
<b>June 9th to 10th, 2021:</b> The SciNet datacentre will have a scheduled maintenance shutdown. Niagara, Mist, Rouge, HPSS, login nodes, the file systems, and hosted systems will all be offline during the shutdown starting at 7AM EDT on Wednesday June 9th. We expect the systems to be back up in the morning of Friday June 11th. Check here for updates.<br />
<br />
<b>May 27, 2021:</b> Datamovers addresses have changed to improve high bandwidth connectivity and cybersecurity. The new addresses are 142.1.174.227 for nia-datamover1.scinet.utoronto.ca, and 142.1.174.228 for nia-datamover2.scinet.utoronto.ca.<br />
<br />
May 27th, 20:00. All systems are up and running <br />
<br />
May 27th, 19:30. Most systems are up<br />
<br />
May 27th, 19:00: Cooling is back. Powering up systems<br />
<br />
May 27th, 2021, 11:30am: The cooling tower issue has been identified as a wiring issue and is being repaired. We don't have an ETA on when cooling will be restored, however we are hopeful it will be by the end of the day. <br />
<br />
May 27th, 2021, 12:30am: Cooling tower motor is not working properly and may need to be replaced. Its the primary motor and the cooling system can not run without it, so at least until tomorrow all equipment at the datacenter will remain unavailable. Updates about expected repair times will be posted when they are known.<br />
<br />
May 26th, 2021, 9:20pm: we are currently experiencing cooling issues at the SciNet data centre. Updates will be posted as we determine the cause of the problem.<br />
<br />
From Tue Mar 30 at 12 noon EST to Thu Apr 1 at 12 noon EST, there will be a two-day reservation for the "Niagara at Scale" pilot event. During these 48 hours, only "Niagara at Scale" projects will run on the compute notes (as well as SOSCIP projects, on a subset of nodes). All other users can still login, access their data, and submit jobs throughout this event, but the jobs will not run until after the event. The debugjob queue will remain available to everyone as well.<br />
<br />
The scheduler will not start batch jobs that cannot finish before the start of this event. Users can submit small and short jobs can take advantage of this, as the scheduler may be able to fit these jobs in before the event starts on the otherwise idle nodes.<br />
<br />
Tue 23 Mar 2021 12:19:07 PM EDT - Planned external network maintenance 12pm-1pm Tuesday, March 23rd. <br />
<br />
Thu Jan 28 17:35:16 EST 2021: <b> HPSS services are back online</b> <br />
<!--* we have not been able to secure a maintenance window today. Will try tomorrow or over the weekend.--><br />
<br />
Thu Jan 28 12:36:21 EST 2021: <b> HPSS services offline</b><br />
* We need a small maintenance window as early as possible still this afternoon to perform a small change in configuration. Ongoing jobs will be allowed to finish, but we are keeping new submissions on hold on the queue.<br />
<br />
Mon Jan 25 13:16:33 EST 2021: <b> HPSS services are back online</b> <br />
<br />
Sat Jan 23 10:03:33 EST 2021: <b> HPSS services offline</b> <br />
* We detected some type of hardware failure on our HPSS equipment overnight, so access has been disabled pending further investigation.<br />
<br />
Fri Jan 22 10:49:29 EST 2021: <b> The Globus transition to oauth is finished</b> <br />
* Please deactivate any previous sessions to the niagara endpoint (in the last 7 days), and activate/login again. <br />
* For more details check https://docs.scinet.utoronto.ca/index.php/Globus#computecandada.23niagara<br />
<br />
Jan 21, 2021: <b>Globus access disruption on Fri, Jan/22/2021 10AM:</b> Please be advised that we will have a maintenance window starting tomorrow at 10AM to roll out the transition of services to oauth based authentication.<br />
<br />
Jan 15, 2021: <b>Globus access update on Mon, Jan/18/2021 and Tue, Jan/19/2021:</b> <br />
<p>Please be advised we start preparations on Monday to perform update to Globus access on Tuesday. We'll be adopting oauth instead of myproxy from that point on. During this period expect sporadic disruptions of service. On Monday we'll already block access to nia-dm2, so please refrain from starting new login sessions or ssh tunnels via nia-dm2 from this weekend already.<br />
<br />
<b> December 11,2020, 12:00 AM EST: </b> Cooling issue resolved. Systems back.<br />
<br />
<b> December 11,2020, 6:00 PM EST: </b> Cooling issue at datacenter. All systems down.<br />
<br />
<b> December 7, 2020, 7:25 PM EST: </b>All systems back; users can log in again.<br />
<br />
<b> December 7, 2020, 6:46 PM EST: </b>User connectivity to data center not yet ready, but queued jobs on Mist and Niagara have been started.<br />
<br />
<b> December 7, 2020, 7:00 AM EST: </b>Maintenance shutdown in effect. This is a one-day maintenance shutdown. There will be no access to Niagara, Mist, HPSS or teach, nor to their file systems during this time. We expect to be able to bring the systems back online this evening.<br />
<br />
<b> December 2, 2020, 9:10 PM EST: </b>Power is back, systems are coming up. Please resubmit any jobs that failed because of this incident.<br />
<br />
<b> December 2, 2020, 6:00 PM EST: </b>Power glitch at the data center, caused about half of the compute nodes to go down. Power issue not yet resolved.<br />
<br />
<b> <span style="color:#dd1111">Announcing a Maintenance Shutdown on December 7th, 2020</span></b> <br/>There will be a one-day maintenance shutdown on December 7th 2020, starting at 7 am EST. There will be no access to Niagara, Mist, HPSS or teach, nor to their file systems during this time. We expect to be able to bring the systems back online in the evening of the same day.<br />
<br />
<b> November 6, 2020, 8:00 PM EST: </b> Systems are coming back online.<br />
<br />
<b> November 6, 2020, 9:49 AM EST: </b> Repairs on the cooling system are underway. No ETA, but the systems will likely be back some time today.<br />
<br />
<b> November 6, 2020, 4:27 AM EST: </b>Cooling system failure, datacentre is shut down.<br />
<br />
<b> October 9, 2020, 12:57 PM: </b> A short power glitch caused many of the Niagara compute nodes to lose power; jobs running on them would have failed. Please check your jobs and resubmit.<br />
<br />
<b> October 8, 2020, 9:50 PM: </b> Jupyterhub service is back up.<br />
<br />
<b> October 8, 2020, 5:40 PM: </b> Jupyterhub service is down. We are investigating.<br />
<br />
<b> September 28, 2020, 11:00 AM EST: </b> A short power glitch caused many of the Niagara compute nodes to lose power; jobs running on them would have failed. Please check your jobs and resubmit.<br />
<br />
<b> September 1, 2020, 2:15 PM EST: </b> A short power glitch caused about half of the Niagara compute nodes to lose power; jobs running on them would have failed. Please check your jobs and resubmit.<br />
<br />
<b> September 1, 2020, 9:27 AM EST: </b> The Niagara cluster has moved to a new default software stack, NiaEnv/2019b. If your job scripts used the previous default software stack before (NiaEnv/2018a), please put the command "module load NiaEnv/2018a" before other module commands in those scripts, to ensure they will continue to work, or try the new stack (recommended).<br />
<b> August 24, 2020, 7:37 PM EST: </b> Connectivity is back to normal<br />
<br />
<b> August 24, 2020, 6:35 PM EST: </b> We have partial connectivity back, but are still investigating.<br />
<br />
<b> August 24, 2020, 3:15 PM EST: </b> There are issues connecting to the data centre. We're investigating.<br />
<br />
<b> August 21, 2020, 6:00 PM EST: </b> The pump has been repaired, cooling is restored, systems are up. <br/>Scratch purging is postponed until the evening of Friday Aug 28th, 2020.<br />
<br />
<b>August 19, 2020, 4:40 PM EST:</b> Update: The current estimate is to have the cooling restored on Friday and we hope to have the systems available for users on Saturday August 22, 2020.<br />
<br />
<b>August 17, 2020, 4:00 PM EST:</b> Unfortunately after taking the pump apart it was determined there was a more serious failure of the main drive shaft, not just the seal. As a new one will need to be sourced or fabricated we're estimating that it will take at least a few more days to get the part and repairs done to restore cooling. Sorry for the inconvenience. <br />
<br />
<b>August 15, 2020, 1:00 PM EST:</b> Due to parts availablity to repair the failed pump and cooling system it is unlikely that systems will be able to be restored until Monday afternoon at the earliest. <br />
<br />
<b>August 15, 2020, 00:04 AM EST:</b> A primary pump seal in the cooling infrastructure has blown and parts availability will not be able be determined until tomorrow. All systems are shut down as there is no cooling. If parts are available, systems may be back at the earliest late tomorrow. Check here for updates. <br />
<br />
<b>August 14, 2020, 21:04 AM EST:</b> Tomorrow's /scratch purge has been postponed.<br />
<br />
<b>August 14, 2020, 21:00 AM EST:</b> Staff at the datacenter. Looks like one of the pumps has a seal that is leaking badly.<br />
<br />
<b>August 14, 2020, 20:37 AM EST:</b> We seem to be undergoing a thermal shutdown at the datacenter.<br />
<br />
<b>August 14, 2020, 20:20 AM EST:</b> Network problems to niagara/mist. We are investigating.<br />
<br />
<b>August 13, 2020, 10:40 AM EST:</b> Network is fixed, scheduler and other services are back.<br />
<br />
<b>August 13, 2020, 8:20 AM EST:</b> We had an IB switch failure, which is affecting a subset of nodes, including the scheduler nodes.<br />
<br />
<b>August 10, 2020, 7:30 PM EST:</b> Scheduler fully operational again.<br />
<br />
<b>August 10, 2020, 3:00 PM EST:</b> Scheduler partially functional: jobs can be submitted and are running.<br />
<br />
<b>August 10, 2020, 2:00 PM EST:</b> Scheduler is temporarily inoperational.<br />
<br />
<b>August 7, 2020, 9:15 PM EST:</b> Network is fixed, scheduler and other services are coming back.<br />
<br />
<b>August 7, 2020, 8:20 PM EST:</b> Disruption of part of the network in the data centre. Causes issue with the scheduler, the mist login node, and possibly others. We are investigating.<br />
<br />
<b>July 30, 2020, 9:00 AM</b> Project backup in progress but incomplete: please be aware that after we deployed the new, larger storage appliance for scratch and project two months ago, we started a full backup of project (1.5PB). This backup is taking a while to complete, and there are still a few areas which have not been backed up fully. Please be careful to not delete things from project that you still need, in particular if they are recently added material.<br />
<br />
<b>July 27, 2020, 5:00 PM:</b> Scheduler issues resolved.<br />
<br />
<b>July 27, 2020, 3:00 PM:</b> Scheduler issues. We are investigating.<br />
<br />
<b>July 13, 4:40 PM:</b> Most systems are available again. Only Mist is still being brought up.<br />
<br />
<b>July 13, 10:00 AM:</b> '''SciNet/Niagara Downtime In Progress'''<br />
<br />
<b>SciNet/Niagara Downtime Announcement, July 13, 2020</b><br/><br />
All resources at SciNet will undergo a maintenance shutdown on Monday July 13, 2020, starting at 10:00 am EDT, for file system and scheduler upgrades. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time.<br />
We expect to be able to bring the systems back around 3 PM (EST) on the same day.<br />
<br />
<b> June 29, 6:21:00 PM:</b> Systems are available again. <br />
<br />
<b> June 29, 12:30:00 PM:</b> Power Outage caused thermal shutdown.<br />
<br />
<b>June 20, 2020, 10:24 PM:</b> File systems are back up. Unfortunately, all running jobs would have died and users are asked to resubmit them.<br />
<br />
<b>June 20, 2020, 9:48 PM:</b> An issue with the file systems is causing trouble. We are investigating the cause.<br />
<br />
<b>June 15, 2020, 10:30 PM:</b> A <b>power glitch</b> caused some compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.<br />
<br />
<b>June 12, 2020, 6:15 PM:</b> Two <b>power glitches</b> during the night caused some compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.<br />
<br />
<b>June 6, 2020, 6:06 AM:</b> A <b>power glitch</b> caused some compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.<br />
<br />
<b>May 24, 2020, 8:20 AM:</b> A <b>power glitch</b> this morning caused all compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.<br />
<br />
<b>May 7, 2020, 6:05 PM:</b> Maintenance shutdown is finished. Most systems are back in production.<br />
<br />
<b>May 6, 2020, 7:08 AM:</b> Two-day datacentre maintenance shutdown has started.<br />
<br />
<b> SciNet/Niagara Downtime Announcement, May 6-7, 2020</b><br />
<br />
All resources at SciNet will undergo a two-day maintenance shutdown on May 6th and 7th 2020, starting at 7 am EDT on Wednesday May 6th. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) or systems hosted at the SciNet data centre. We expect to be able to bring the systems back online the evening of May 7th.<br />
<br />
<b>May 4, 2020, 7:51 AM:</b> A power glitch this morning caused compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.<br />
<br />
<b>May 3, 2020, 8:20 AM:</b> A power glitch this morning caused all compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.<br />
<br />
<b>April 28, 2020, 7:20 AM:</b> A power glitch this morning caused all compute nodes to be rebooted: jobs running at the time have failed; users are asked to resubmit these jobs.<br />
<br />
<b>April 20, 2020: Security Incident at Cedar; implications for Niagara users</b><br />
<br />
Last week, it became evident that the Cedar GP cluster had been<br />
comprimised for several weeks. The passwords of at least two<br />
Compute Canada users were known to the attackers. One of these was<br />
used to escalate privileges on Cedar, as explained on<br />
https://status.computecanada.ca/view_incident?incident=423.<br />
<br />
These accounts were used to login to Niagara as well, but Niagara<br />
did not have the same security loophole as Cedar (which has been<br />
fixed), and no further escalation was observed on Niagara.<br />
<br />
Reassuring as that may sound, it is not known how the passwords of<br />
the two user accounts were obtained. Given this uncertainty, the<br />
SciNet team *strongly* recommends that you change your password on<br />
https://ccdb.computecanada.ca/security/change_password, and remove<br />
any SSH keys and regenerate new ones (see<br />
https://docs.scinet.utoronto.ca/index.php/SSH_keys).<br />
<br />
<b> Tue 30 Mar 2020 14:55:14 EDT</b> Burst Buffer available again.<br />
<br />
<b> Fri Mar 27 15:29:00 EDT 2020:</b> SciNet systems are back up. Only the Burst Buffer remains offline, its maintenance is expected to be finished early next week.<br />
<br />
<b> Thu Mar 26 23:05:00 EDT 2020:</b> Some aspects of the maintenance took longer than expected. The systems will not be back up until some time tomorrow, Friday March 27, 2020. <br />
<br />
<b> Wed Mar 25 7:00:00 EDT 2020:</b> SciNet/Niagara downtime started.<br />
<br />
<b> Mon Mar 23 18:45:10 EDT 2020:</b> File system issues were resolved.<br />
<br />
<b> Mon Mar 23 18:01:19 EDT 2020:</b> There is currently an issue with the main Niagara filesystems. This effects all systems, all jobs have been killed. The issue is being investigated. <br />
<br />
<b> Fri Mar 20 13:15:33 EDT 2020: </b> There was a power glitch at the datacentre at 8:50 AM, which resulted in jobs getting killed. Please resubmit failed jobs. <br />
<br />
<b> COVID-19 Impact on SciNet Operations, March 18, 2020</b><br />
<br />
Although the University of Toronto is closing of some of its<br />
research operations on Friday March 20 at 5 pm EDT, this does not<br />
affect the SciNet systems (such as Niagara, Mist, and HPSS), which<br />
will remain operational.<br />
<br />
<b> SciNet/Niagara Downtime Announcement, March 25-26, 2020</b><br />
<br />
All resources at SciNet will undergo a two-day maintenance shutdown on March 25th and 26th 2020, starting at 7 am EDT on Wednesday March 25th. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time.<br />
<br />
This shutdown is necessary to finish the expansion of the Niagara cluster and its storage system.<br />
<br />
We expect to be able to bring the systems back online the evening of March 26th.<br />
<br />
<b> March 9, 2020, 11:24 PM:</b> HPSS services are temporarily suspended for emergency maintenance.<br />
<br />
<b> March 7, 2020, 10:15 PM:</b> File system issues have been cleared.<br />
<br />
<b> March 6, 2020, 7:30 PM:</b> File system issues; we are investigating<br />
<br />
<b> March 2, 2020, 1:30 PM:</b> For the extension of Niagara, the operating system on all Niagara nodes has been upgraded<br />
from CentOS 7.4 to 7.6. This required all<br />
nodes to be rebooted. Running compute jobs are allowed to finish<br />
before the compute node gets rebooted. Login nodes have all been rebooted, as have the datamover nodes and the jupyterhub service.<br />
<br />
<b> Feb 24, 2020, 1:30PM: </b> The [[Mist]] login node got rebooted. It is back, but we are still monitoring the situation.<br />
<br />
<b> Feb 12, 2020, 11:00AM: </b> The [[Mist]] GPU cluster now available to users.<br />
<br />
<b> Feb 11, 2020, 2:00PM: </b> The Niagara compute nodes were accidentally rebooted, killing all running jobs.<br />
<br />
<b> Feb 10, 2020, 19:00PM: </b> HPSS is back to normal.<br />
<br />
<b> Jan 30, 2020, 12:01PM: </b> We are having an issue with HPSS, in which the disk-cache is full. We put a reservation on the whole system (Globus, plus archive and vfs queues), until it has had a chance to clear some space on the cache.<br />
<br />
<b> Jan 21, 2020, 4:05PM: </b> The was a partial power outage the took down a large amount of the compute nodes. If your job died during this period please resubmit. <br />
<br />
<b>Jan 13, 2020, 7:35 PM:</b> Maintenance finished.<br />
<br />
<b>Jan 13, 2020, 8:20 AM:</b> The announced maintenance downtime started (see below).<br />
<br />
<b>Jan 9 2020, 11:30 AM:</b> External ssh connectivity restored, issue related to the university network.<br />
<br />
<b>Jan 9 2020, 9:24 AM:</b> We received reports of users having trouble connecting into the SciNet data centre; we're investigating. Systems are up and running and jobs are fine.<p><br />
As a work around, in the meantime, it appears to be possible to log into graham, cedar or beluga, and then ssh to niagara.</p><br />
<br />
<b>Downtime announcement:</b><br />
To prepare for the upcoming expansion of Niagara, there will be a<br />
one-day maintenance shutdown on <b>January 13th 2020, starting at 8 am<br />
EST</b>. There will be no access to Niagara, Mist, HPSS or teach, nor<br />
to their file systems during this time.<br />
<br />
2019<br />
<br />
<b>December 13, 9:00 AM EST:</b> Issues resolved.<br />
<br />
<b>December 13, 8:20 AM EST:</b> Overnight issue is now preventing logins to Niagara and other services. Possibly a file system issue, we are investigating.<br />
<br />
<p> <b>Fri, Nov 15 2019, 11:00 PM (EST)</b> Niagara and most of the main systems are now available. <br />
</p><p> <b>Fri, Nov 15 2019, 7:50 PM (EST)</b> SOSCIP GPU cluster is up and accessible. Work on the other systems continues.<br />
</p><p> <b>Fri, Nov 15 2019, 5:00 PM (EST)</b> Infrastructure maintenance done, upgrades still in process.<br />
</p><p><br />
<b>Fri, Nov 15 2019, 7:00 AM (EST)</b> Maintenance shutdown of the SciNet data centre has started. Note: scratch purging has been postponed until Nov 17.<br/> <br />
</p><br />
<p><br />
<b>Announcement:</b> <br />
The SciNet datacentre will undergo a maintenance shutdown on<br />
Friday November 15th 2019, from 7 am to 11 pm (EST), with no access<br />
to any of the SciNet systems (Niagara, P8, SGC, HPSS, Teach cluster,<br />
or the filesystems) during that time. <br />
<br />
<br />
<b>Sat, Nov 2 2019, 1:30 PM (update):</b> Chiller has been fixed, all systems are operational. <br />
</p><br />
<b>Fri, Nov 1 2019, 4:30 PM (update):</b> We are operating in free cooling so have brought up about 1/2 of the Niagara compute nodes to reduce the cooling load. Access, storage, and other systems should now be available. <br />
<br />
<b>Fri, Nov 1 2019, 12:05 PM (update):</b> A power module in the chiller has failed and needs to be replaced. We should be able to operate in free cooling if the temperature stays cold enough, but we may not be able to run all systems. No eta yet on when users will be able to log back in. <br />
<br />
<b>Fri, Nov 1 2019, 9:15 AM (update):</b> There was a automated shutdown because of rising temperatures, causing all systems to go down. We are investigating, check here for updates.<br />
<br />
<p><b>Fri, Nov 1 2019, 8:16 AM:</b> Unexpected data centre issue: Check here for updates.<br />
</p><br />
<br />
<b> Thu 1 Aug 2019 5:00:00 PM </b> Systems are up and operational. <br />
<br />
<b>Thu 1 Aug 2019 7:00:00 AM: </b> Scheduled Downtime Maintenance of the SciNet Datacenter. All systems will be down and unavailable starting 7am until the evening. <br />
<br />
<b>Fri 26 Jul 2019, 16:02:26 EDT:</b> There was an issue with the Burst Buffer at around 3PM, and it was recently solved. BB is OK again.<br />
<br />
<b> Sun 30 Jun 2019 </b> The <b>SOSCIP BGQ</b> and <b>P7</b> systems were decommissioned on <b>June 30th, 2019</b>. The BGQdev front end node and storage are still available. <br />
<br />
<b>Wed 19 Jun 2018, 1:20:00 PM:</b> The BGQ is back online.<br />
<br />
<b>Wed 19 Jun 2018, 10:00:00 AM:</b> The BGQ is still down, the SOSCIP GPU nodes should be back up. <br />
<br />
<b>Wed 19 Jun 2018, 1:40:00 AM:</b> There was an issue with the SOSCIP BGQ and GPU Cluster last night about 1:42am, probably a power fluctuation that took it down. <br />
<br />
<b>Wed 12 Jun 2019, 3:30 AM - 7:40 AM</b> Intermittent system issues on Niagara's project and scratch as the file number limit was reached. We increased the number of files allowed in total on the file system. <br />
<br />
<b>Thu 30 May 2019, 11:00:00 PM:</b><br />
The maintenance downtime of SciNet's data center has finished, and systems are being brought online now. You can check the progress here. Some systems might not be available until Friday morning.<br/><br />
Some action on the part of users will be required when they first connect again to a Niagara login nodes or datamovers. This is due to the security upgrade of the Niagara cluster, which is now in line with currently accepted best practices.<br/><br />
The details of the required actions can be found on the [[SSH Changes in May 2019]] wiki page.<br />
<br />
<b>Wed 29-30 May 2019</b> The SciNet datacentre will undergo a two-day maintenance shutdown, starting at 7 am EDT on Wednesday May 29th. There will be no access to any of the SciNet systems (Niagara, P7, P8, BGQ, SGC, HPSS, Teach cluster, or the file systems) during this time.<br />
<br />
'''SCHEDULED SHUTDOWN''': <br />
<br />
Please be advised that on '''Wednesday May 29th through Thursday May 30th''', the SciNet datacentre will undergo a two-day maintenance shutdown, starting at 7 am EDT on Wednesday May 29th. There will be no access to any of the SciNet systems (Niagara, P7, P8, BGQ, SGC, HPSS, Teach cluster, or the file systems) during this time.<br />
<br />
This is necessary to finish the installation of an emergency power generator, to perform the annual cooling tower maintenance, and to enhance login security.<br />
<br />
We expect to be able to bring the systems back online the evening of May 30th. Due to the enhanced login security, the ssh applications of users will need to update their known host list. More detailed information on this procedure will be sent shortly before the systems are back online.<br />
<br />
Fri 5 Apr 2019: Software updates on Niagara: The default CCEnv software stack now uses avx512 on Niagara, and there is now a NiaEnv/2019b stack ("epoch"). <br />
<br />
Thu 4 Apr 2019: The 2019 compute and storage allocations have taken effect on Niagara.<br />
<br />
'''NOTE''': There is scheduled network maintenance for '''Friday April 26th 12am-8am''' on the Scinet datacenter external network connection. This will not affect internal connections and running jobs however remote connections may see interruptions during this period.<br />
<br />
<br />
Wed 24 Apr 2019 14:14 EDT: HPSS is back on service. Library and robot arm maintenance finished.<br />
<br />
Wed 24 Apr 2019 08:35 EDT: HPSS out of service this morning for library and robot arm maintenance.<br />
<br />
Fri 19 Apr 2019 17:40 EDT: HPSS robot arm has been released and is back to normal operations.<br />
<br />
Fri 19 Apr 2019 14:00 EDT: problems with HPPS library robot have been detected.<br />
<br />
Wed 17 Apr 2019 15:35 EDT: Network connection is back.<br />
<br />
Wed 17 Apr 2019 15:12 EDT: Network connection down. Investigating.<br />
<br />
Tue 9 Apr 2019 22:24:14 EDT: Network connection restored.<br />
<br />
Tue 9 Apr 2019, 15:20: Network connection down. Investigating.<br />
<br />
Fri 5 Apr 2019: Planned, short outage in connectivity to the SciNet datacentre from 7:30 am to 8:55 am EST for maintenance of the network. This outage will not affect running or queued jobs. It may be necessary to reboot the login nodes at some point tomorrow, which could result in a short interruption of connectivity, but which will have no effect on running or queued jobs.<br />
<br />
<br />
April 4, 2019: The 2019 compute and storage allocations will take effect on Niagara. Running jobs will not be affected by this change and will run their course. Queued jobs' priorities will be updated to reflect the new fairshare values later in the day. The queue should fully reflect the new fairshare values in about 24 hours. <br />
<br />
It may be necessary to reboot the login nodes at some point tomorrow, which could result in a short interruption of connectivity, but which will have no effect on running or queued jobs.<br />
<br />
There will be updates to the software stack on this day as well.<br />
<br />
March 25, 3:05 PM EST: Most systems back online, other services should be back shortly. <br />
<br />
March 25, 12:05 PM EST: Power is back at the datacentre, but it is not yet known when all systems will be back up. Keep checking here for updates.<br />
<br />
March 25, 11:27 AM EST: A power outage in the datacentre occured and caused all services to go down. Check here for updates.<br />
<br />
<b>Thu Mar 21 10:37:28 EDT 2019:</b> HPSS is back in service<br />
<br />
HPSS out of service on '''Tue, Mar/19 at 9AM''', for tape library expansion and relocation. It's possible the downtime will extend to Wed, Mar/20.<br />
<br />
<b>January 21, 4:00 PM</b>: HPSS is back in service. Thank you for your patience.<br />
<br />
<b>January 18, 5:00 PM</b>: We did practically all of the HPSS upgrades (software/hardware), however the main client node - archive02 - is presenting an issue we just couldn't resolve yet. We will try to resume work over the weekend with cool heads, or on Monday. Sorry, but this is an unforeseen delay. Jobs on the queue we'll remain there, and we'll delay the scratch purging by 1 week. <br><br />
<br />
<b>January 16, 11:00 PM</b>: HPSS is being upgraded, as announced. <br><br />
<br />
<b>January 16, 8:00 PM</b>: System are coming back up and should be accessible for users now.<br><br />
<br />
<b>January 15, 8:00 AM</b>: Data centre downtime in effect.<br><br />
<br />
* <font color=red>Downtime Announcement for January 15 and 16, 2019</font><br><br />
The SciNet datacentre will need to undergo a two-day maintenance shutdown in order to perform electrical work, repairs and maintenance. The electrical work is in preparation for the upcoming installation of an emergency power generator and a larger UPS, which will result in increased resilience to power glitches and outages. The shutdown is scheduled to start on <b>Tuesday January 15, 2019, at 7 am</b> and will last until <b>Wednesday 16, 2019</b>, some time in the evening. There will be no access to any of the SciNet systems (Niagara, P7, P8, BGQ, SGC, HPSS, Teach cluster, or the filesystems) during this time.<br />
Check back here for up-to-date information on the status of the systems.<br />
<br />
Note: this downtime was originally scheduled for Dec. 18, 2018, but has been postponed and combined with the annual maintenance downtime.<br />
<br />
* December 24, 2018, 11:35 AM EST: Most systems are operational again. If you had compute jobs running yesterday at around 3:30PM, they likely crashed - please check them and resubmit if needed.<br />
<br />
* December 24, 2018, 10:40 AM EST: Repairs have been made, and the file systems are starting to be mounted on the cluster. <br />
<br />
* December 23, 2018, 3:38 PM EST: Issues with the file systems (home, scratch and project). We are investigating, it looks like a hardware issue that we are trying to work around. Note that the absence of /home means you cannot log in with ssh keys. All compute jobs crashed around 3:30 PM EST on Dec 23. Once the system is properly up again, please resubmit your jobs. Unfortunately, at this time of year, it is not possible to give an estimate on when the system will be operational again.<br />
<br />
* '''Tue Nov 22 14:20:00 EDT 2018''': <font color=green>HPSS back in service</font><br />
* '''Tue Nov 22 08:55:00 EDT 2018''': <font color=red>HPSS offline for scheduled maintenance</font><br />
* '''Tue Nov 20 16:30:00 EDT 2018''': HPSS offline on Thursday 9AM for installation of new LTO8 drives in the tape library.<br />
* '''Tue Oct 9 12:16:00 EDT 2018''': BGQ compute nodes are up. <br />
* '''Sun Oct 7 20:24:26 EDT 2018''': SGC and BGQ front end are available, BGQ compute nodes down related to a cooling issue. <br />
* '''Sat Oct 6 23:16:44 EDT 2018''': There were some problems bringing up SGC & BGQ, they will remain offline for now.<br />
* '''Sat Oct 6 18:36:35 EDT 2018''': Electrical work finished, power restored. Systems are coming online.<br />
* July 18, 2018: login.scinet.utoronto.ca is now disabled, GPC $SCRATCH and $HOME are decommissioned.<br />
* July 12, 2018: There was a short power interruption around 10:30 am which caused most of the systems (Niagara, SGC, BGQ) to reboot and any running jobs to fail. <br />
* July 11, 2018: P7's moved to BGQ filesystem, P8's moved to Niagara filesystem.<br />
* May 24, 2018, 9:25 PM EST: The data center is up, and all systems are operational again.<br />
* May 24, 2018, 7:00 AM EST: The data centre is under annual maintenance. All systems are offline. Systems are expected to be back late afternoon today; check for updates on this page.<br />
* May 18, 2018: Announcement: Annual scheduled maintenance downtime: Thursday May 24, starting 7:00 AM<br />
* May 16, 2018: Cooling restored, systems online<br />
* May 16, 2018: Cooling issue at datacentre again, all systems down<br />
* May 15, 2018: Cooling restored, systems coming online<br />
* May 15, 2018: Cooling issue at datacentre, all systems down<br />
* May 4, 2018: [[HPSS]] is now operational on Niagara.<br />
* May 3, 2018: [[Burst Buffer]] is available upon request.<br />
* May 3, 2018: The [https://docs.computecanada.ca/wiki/Globus Globus] endpoint for Niagara is available: computecanada#niagara.<br />
* May 1, 2018: System status moved here.<br />
* Apr 23, 2018 GPC-compute is decommissioned, GPC-storage available until 30 May 2018.<br />
* April 10, 2018: Niagara commissioned.</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3225Main Page2021-09-23T16:49:06Z<p>Northrup: </p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Down |Niagara|Niagara_Quickstart}}<br />
|{{Down |Mist|Mist}}<br />
|{{Down |Teach|Teach}}<br />
|{{Down |Rouge|Rouge}}<br />
|-<br />
|{{Down |Jupyter Hub|Jupyter_Hub}}<br />
|{{Down |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Down |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Down |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down |HPSS|HPSS}}<br />
|{{Down |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Down |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down |Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<b>Wed Sep 23 12:30 EDT 2021 </b> Cooling restored. Systems should be available later this afternoon. <br />
<br />
<b>Wed Sep 23 9:30 EDT 2021 </b> Technicians on site working on cooling system. <br />
<br />
<b>Wed Sep 23 3:30 EDT 2021 </b> Cooling system issues still unresolved. <br />
<br />
<b>Wed Sep 22 23:27:48 EDT 2021 </b> Shutdown of the datacenter due to a problem with the cooling system.<br />
<br />
<b>Wed Sep 22 09:30 EDT 2021 </b>: File system issues, resolved.<br />
<br />
<b>Wed Sep 22 07:30 EDT 2021 </b>: File system issues, investigating.<br />
<br />
<b>Sun Sep 19 10:00 EDT 2021</b>: Power glitch interrupted all compute jobs; please resubmit any jobs you had running.<br />
<br />
<b>Wed Sep 15 17:35 EDT 2021</b>: filesystem issues resolved<br />
<br />
<b>Wed Sep 15 16:39 EDT 2021</b>: filesystem issues<br />
<br />
<b>Mon Sep 13 13:15:07 EDT 2021</b> HPSS is back online.<br />
<br />
<b>Fri Sep 10 17:57:23 EDT 2021</b> HPSS is offline due to unscheduled maintenance.<br />
<br />
<b>Wed Aug 18 16:13:42 EDT 2021</b> The HPSS upgrade is complete.<br />
<br />
<b>HPSS Downtime August 17th and 18th, 2021 (Tuesday and Wednesday):</b> We'll be upgrading the HPSS software to version 8.3, along with all the clients (htar/hsi, vfs and Globus/dsi)<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3224Main Page2021-09-23T14:08:31Z<p>Northrup: </p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Down |Niagara|Niagara_Quickstart}}<br />
|{{Down |Mist|Mist}}<br />
|{{Down |Teach|Teach}}<br />
|{{Down |Rouge|Rouge}}<br />
|-<br />
|{{Down |Jupyter Hub|Jupyter_Hub}}<br />
|{{Down |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Down |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Down |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down |HPSS|HPSS}}<br />
|{{Down |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Down |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down |Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<b>Wed Sep 23 9:30 EDT 2021 </b> Technicians on site working on cooling system. <br />
<br />
<b>Wed Sep 23 3:30 EDT 2021 </b> Cooling system issues still unresolved. <br />
<br />
<b>Wed Sep 22 23:27:48 EDT 2021 </b> Shutdown of the datacenter due to a problem with the cooling system.<br />
<br />
<b>Wed Sep 22 09:30 EDT 2021 </b>: File system issues, resolved.<br />
<br />
<b>Wed Sep 22 07:30 EDT 2021 </b>: File system issues, investigating.<br />
<br />
<b>Sun Sep 19 10:00 EDT 2021</b>: Power glitch interrupted all compute jobs; please resubmit any jobs you had running.<br />
<br />
<b>Wed Sep 15 17:35 EDT 2021</b>: filesystem issues resolved<br />
<br />
<b>Wed Sep 15 16:39 EDT 2021</b>: filesystem issues<br />
<br />
<b>Mon Sep 13 13:15:07 EDT 2021</b> HPSS is back online.<br />
<br />
<b>Fri Sep 10 17:57:23 EDT 2021</b> HPSS is offline due to unscheduled maintenance.<br />
<br />
<b>Wed Aug 18 16:13:42 EDT 2021</b> The HPSS upgrade is complete.<br />
<br />
<b>HPSS Downtime August 17th and 18th, 2021 (Tuesday and Wednesday):</b> We'll be upgrading the HPSS software to version 8.3, along with all the clients (htar/hsi, vfs and Globus/dsi)<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3223Main Page2021-09-23T08:00:52Z<p>Northrup: </p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Down |Niagara|Niagara_Quickstart}}<br />
|{{Down |Mist|Mist}}<br />
|{{Down |Teach|Teach}}<br />
|{{Down |Rouge|Rouge}}<br />
|-<br />
|{{Down |Jupyter Hub|Jupyter_Hub}}<br />
|{{Down |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Down |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Down |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down |HPSS|HPSS}}<br />
|{{Down |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Down |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down |Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<b>Wed Sep 23 3:30 EDT 2021 </b> Cooling system issues still unresolved. <br />
<br />
<b>Wed Sep 22 23:27:48 EDT 2021 </b> Shutdown of the datacenter due to a problem with the cooling system.<br />
<br />
<b>Wed Sep 22 09:30 EDT 2021 </b>: File system issues, resolved.<br />
<br />
<b>Wed Sep 22 07:30 EDT 2021 </b>: File system issues, investigating.<br />
<br />
<b>Sun Sep 19 10:00 EDT 2021</b>: Power glitch interrupted all compute jobs; please resubmit any jobs you had running.<br />
<br />
<b>Wed Sep 15 17:35 EDT 2021</b>: filesystem issues resolved<br />
<br />
<b>Wed Sep 15 16:39 EDT 2021</b>: filesystem issues<br />
<br />
<b>Mon Sep 13 13:15:07 EDT 2021</b> HPSS is back online.<br />
<br />
<b>Fri Sep 10 17:57:23 EDT 2021</b> HPSS is offline due to unscheduled maintenance.<br />
<br />
<b>Wed Aug 18 16:13:42 EDT 2021</b> The HPSS upgrade is complete.<br />
<br />
<b>HPSS Downtime August 17th and 18th, 2021 (Tuesday and Wednesday):</b> We'll be upgrading the HPSS software to version 8.3, along with all the clients (htar/hsi, vfs and Globus/dsi)<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3221Main Page2021-09-22T17:35:46Z<p>Northrup: </p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up|Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<br />
<b>Wed Sep 22 09:30 EDT 2021 </b>: File system issues, resolved.<br />
<br />
<b>Wed Sep 22 07:30 EDT 2021 </b>: File system issues, investigating.<br />
<br />
<b>Sun Sep 19 10:00 EDT 2021</b>: Power glitch interrupted all compute jobs; please resubmit any jobs you had running.<br />
<br />
<b>Wed Sep 15 17:35 EDT 2021</b>: filesystem issues resolved<br />
<br />
<b>Wed Sep 15 16:39 EDT 2021</b>: filesystem issues<br />
<br />
<b>Mon Sep 13 13:15:07 EDT 2021</b> HPSS is back online.<br />
<br />
<b>Fri Sep 10 17:57:23 EDT 2021</b> HPSS is offline due to unscheduled maintenance.<br />
<br />
<b>Wed Aug 18 16:13:42 EDT 2021</b> The HPSS upgrade is complete.<br />
<br />
<b>HPSS Downtime August 17th and 18th, 2021 (Tuesday and Wednesday):</b> We'll be upgrading the HPSS software to version 8.3, along with all the clients (htar/hsi, vfs and Globus/dsi)<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3220Main Page2021-09-22T11:31:49Z<p>Northrup: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Down |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up|Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<br />
<b>Wed Sep 22 07:30 EDT 2021 </b>: File system issues, investigating.<br />
<br />
<b>Sun Sep 19 10:00 EDT 2021</b>: Power glitch interrupted all compute jobs; please resubmit any jobs you had running.<br />
<br />
<b>Wed Sep 15 17:35 EDT 2021</b>: filesystem issues resolved<br />
<br />
<b>Wed Sep 15 16:39 EDT 2021</b>: filesystem issues<br />
<br />
<b>Mon Sep 13 13:15:07 EDT 2021</b> HPSS is back online.<br />
<br />
<b>Fri Sep 10 17:57:23 EDT 2021</b> HPSS is offline due to unscheduled maintenance.<br />
<br />
<b>Wed Aug 18 16:13:42 EDT 2021</b> The HPSS upgrade is complete.<br />
<br />
<b>HPSS Downtime August 17th and 18th, 2021 (Tuesday and Wednesday):</b> We'll be upgrading the HPSS software to version 8.3, along with all the clients (htar/hsi, vfs and Globus/dsi)<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://education.scinet.utoronto.ca SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Modules for Mist]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3161Main Page2021-07-23T14:04:24Z<p>Northrup: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up|Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<b> July 23th, 2021, 9:00 AM :</b> <b> Security update: </b> Due to a severe vulnerability in the Linux kernel (CVE-2021-33909), our team is currently patching and rebooting all login nodes and compute nodes. There should be no affect on running jobs, however sessions on login and datamover nodes will be disrupted. <br />
<br />
<b> July 20th, 2021, 7:00 PM :</b> <b> SLURM configuration</b> - Changed the default behaviour to kill a job step if any task exits with a non-zero exit code. If your code is able to handle failures gracefully, please add srun's option --no-kill to recover the previous default behaviour.<br />
<br />
<b> July 20th, 2021, 7:00 PM :</b> Maintenance finished, systems are back online. <br />
<br />
<b>SciNet Downtime July 20th, 2021 (Tuesday):</b> There will be a maintenance shutdown of the SciNet data center on Tuesday July 20th, starting at 7 am EDT. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time. We expect to be able to bring the systems back online in the evening of July 20th. The status of the Niagara cluster can be checked on status.computecanada.ca. For up-to-date and more detailed information on the status of all the SciNet systems, you can always check back here.<br />
<br />
<b>June 28th, 2021, 4:06 PM:</b> Mist OS upgrade is complete.<br />
<br />
<b>May 27, 2021:</b> Datamovers addresses have changed to improve high bandwidth connectivity and cybersecurity. The new addresses are 142.1.174.227 for nia-datamover1.scinet.utoronto.ca, and 142.1.174.228 for nia-datamover2.scinet.utoronto.ca.<br />
<br />
If you have jobs that need to connect to a software license server using an ssh tunnel through nia-gw (which actually resolves to datamover1 or datamover2), you may need to ask the system administrators of that license server to allow incoming connections from the new addresses above.<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3160Main Page2021-07-23T14:04:03Z<p>Northrup: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Up |Rouge|Rouge}}<br />
|-<br />
|{{Up |Jupyter Hub|Jupyter_Hub}}<br />
|{{up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up|Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<b> July 23th, 2021, 8:00 AM :</b> <b> Security update </b> Due to a severe vulnerability in the Linux kernel (CVE-2021-33909), our team is currently patching and rebooting all login nodes and compute nodes. There should be no affect on running jobs, however sessions on login and datamover nodes will be disrupted. <br />
<br />
<b> July 20th, 2021, 7:00 PM :</b> <b> SLURM configuration</b> - Changed the default behaviour to kill a job step if any task exits with a non-zero exit code. If your code is able to handle failures gracefully, please add srun's option --no-kill to recover the previous default behaviour.<br />
<br />
<b> July 20th, 2021, 7:00 PM :</b> Maintenance finished, systems are back online. <br />
<br />
<b>SciNet Downtime July 20th, 2021 (Tuesday):</b> There will be a maintenance shutdown of the SciNet data center on Tuesday July 20th, starting at 7 am EDT. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time. We expect to be able to bring the systems back online in the evening of July 20th. The status of the Niagara cluster can be checked on status.computecanada.ca. For up-to-date and more detailed information on the status of all the SciNet systems, you can always check back here.<br />
<br />
<b>June 28th, 2021, 4:06 PM:</b> Mist OS upgrade is complete.<br />
<br />
<b>May 27, 2021:</b> Datamovers addresses have changed to improve high bandwidth connectivity and cybersecurity. The new addresses are 142.1.174.227 for nia-datamover1.scinet.utoronto.ca, and 142.1.174.228 for nia-datamover2.scinet.utoronto.ca.<br />
<br />
If you have jobs that need to connect to a software license server using an ssh tunnel through nia-gw (which actually resolves to datamover1 or datamover2), you may need to ask the system administrators of that license server to allow incoming connections from the new addresses above.<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3155Main Page2021-07-20T23:01:23Z<p>Northrup: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Down |Rouge|Rouge}}<br />
|-<br />
|{{Down |Jupyter Hub|Jupyter_Hub}}<br />
|{{up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down |Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<b> July 20th, 2021, 7:00 PM :</b> <b> SLURM configuration</b> - Changed the default behaviour to kill a job step if any task exits with a non-zero exit code. If your code is able to handle failures gracefully, please add srun's option --no-kill to recover the previous default behaviour.<br />
<br />
<b> July 20th, 2021, 7:00 PM :</b> Maintenance finished, systems are back online. <br />
<br />
<b>SciNet Downtime July 20th, 2021 (Tuesday):</b> There will be a maintenance shutdown of the SciNet data center on Tuesday July 20th, starting at 7 am EDT. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time. We expect to be able to bring the systems back online in the evening of July 20th. The status of the Niagara cluster can be checked on status.computecanada.ca. For up-to-date and more detailed information on the status of all the SciNet systems, you can always check back here.<br />
<br />
<b>June 28th, 2021, 4:06 PM:</b> Mist OS upgrade is complete.<br />
<br />
<b>May 27, 2021:</b> Datamovers addresses have changed to improve high bandwidth connectivity and cybersecurity. The new addresses are 142.1.174.227 for nia-datamover1.scinet.utoronto.ca, and 142.1.174.228 for nia-datamover2.scinet.utoronto.ca.<br />
<br />
If you have jobs that need to connect to a software license server using an ssh tunnel through nia-gw (which actually resolves to datamover1 or datamover2), you may need to ask the system administrators of that license server to allow incoming connections from the new addresses above.<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3154Main Page2021-07-20T23:00:53Z<p>Northrup: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Down |Rouge|Rouge}}<br />
|-<br />
|{{Down |Jupyter Hub|Jupyter_Hub}}<br />
|{{up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down |Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<b> July 20th, 2021, 7:00 PM :</b> <b> SLURM configuration</b> Change default behaviour to kill a job step if any task exits with a non-zero exit code. If your code is able to handle failures gracefully, please add srun's option --no-kill to recover the previous default behaviour.<br />
<br />
<b> July 20th, 2021, 7:00 PM :</b> Maintenance finished, systems are back online. <br />
<br />
<b>SciNet Downtime July 20th, 2021 (Tuesday):</b> There will be a maintenance shutdown of the SciNet data center on Tuesday July 20th, starting at 7 am EDT. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time. We expect to be able to bring the systems back online in the evening of July 20th. The status of the Niagara cluster can be checked on status.computecanada.ca. For up-to-date and more detailed information on the status of all the SciNet systems, you can always check back here.<br />
<br />
<b>June 28th, 2021, 4:06 PM:</b> Mist OS upgrade is complete.<br />
<br />
<b>May 27, 2021:</b> Datamovers addresses have changed to improve high bandwidth connectivity and cybersecurity. The new addresses are 142.1.174.227 for nia-datamover1.scinet.utoronto.ca, and 142.1.174.228 for nia-datamover2.scinet.utoronto.ca.<br />
<br />
If you have jobs that need to connect to a software license server using an ssh tunnel through nia-gw (which actually resolves to datamover1 or datamover2), you may need to ask the system administrators of that license server to allow incoming connections from the new addresses above.<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3153Main Page2021-07-20T23:00:06Z<p>Northrup: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Down |Rouge|Rouge}}<br />
|-<br />
|{{Down |Jupyter Hub|Jupyter_Hub}}<br />
|{{up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down |Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<b> SLURM configuration: </b> Change default behaviour to kill a job step if any task exits with a non-zero exit code. If your code is able to handle failures gracefully, please add srun's option --no-kill to recover the previous default behaviour.<br />
<br />
<b> July 20th, 2021, 7:00 PM :</b> Maintenance finished, systems are back online. <br />
<br />
<b>SciNet Downtime July 20th, 2021 (Tuesday):</b> There will be a maintenance shutdown of the SciNet data center on Tuesday July 20th, starting at 7 am EDT. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time. We expect to be able to bring the systems back online in the evening of July 20th. The status of the Niagara cluster can be checked on status.computecanada.ca. For up-to-date and more detailed information on the status of all the SciNet systems, you can always check back here.<br />
<br />
<b>June 28th, 2021, 4:06 PM:</b> Mist OS upgrade is complete.<br />
<br />
<b>May 27, 2021:</b> Datamovers addresses have changed to improve high bandwidth connectivity and cybersecurity. The new addresses are 142.1.174.227 for nia-datamover1.scinet.utoronto.ca, and 142.1.174.228 for nia-datamover2.scinet.utoronto.ca.<br />
<br />
If you have jobs that need to connect to a software license server using an ssh tunnel through nia-gw (which actually resolves to datamover1 or datamover2), you may need to ask the system administrators of that license server to allow incoming connections from the new addresses above.<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3152Main Page2021-07-20T22:59:49Z<p>Northrup: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up |Niagara|Niagara_Quickstart}}<br />
|{{Up |Mist|Mist}}<br />
|{{Up |Teach|Teach}}<br />
|{{Down |Rouge|Rouge}}<br />
|-<br />
|{{Down |Jupyter Hub|Jupyter_Hub}}<br />
|{{up |Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up |File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up |Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down |HPSS|HPSS}}<br />
|{{Up |Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up |External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down |Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<b> SLURM configuration: </b> Change default behaviour to kill a job step if any task exits with a non-zero exit code. If your code is able to handle failures gracefully, please add srun's option --no-kill to recover the previous default behaviour.<br />
<br />
<b> July 20th, 2021, 7:00 PM :</b> Maintenance finished systems are back online. <br />
<br />
<b>SciNet Downtime July 20th, 2021 (Tuesday):</b> There will be a maintenance shutdown of the SciNet data center on Tuesday July 20th, starting at 7 am EDT. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time. We expect to be able to bring the systems back online in the evening of July 20th. The status of the Niagara cluster can be checked on status.computecanada.ca. For up-to-date and more detailed information on the status of all the SciNet systems, you can always check back here.<br />
<br />
<b>June 28th, 2021, 4:06 PM:</b> Mist OS upgrade is complete.<br />
<br />
<b>May 27, 2021:</b> Datamovers addresses have changed to improve high bandwidth connectivity and cybersecurity. The new addresses are 142.1.174.227 for nia-datamover1.scinet.utoronto.ca, and 142.1.174.228 for nia-datamover2.scinet.utoronto.ca.<br />
<br />
If you have jobs that need to connect to a software license server using an ssh tunnel through nia-gw (which actually resolves to datamover1 or datamover2), you may need to ask the system administrators of that license server to allow incoming connections from the new addresses above.<br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3026Main Page2021-05-27T16:00:12Z<p>Northrup: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Down|Niagara|Niagara_Quickstart}}<br />
|{{Down|Mist|Mist}}<br />
|{{Down|Teach|Teach}}<br />
|{{Down|Rouge|Rouge}}<br />
|-<br />
|{{Down|Jupyter Hub|Jupyter_Hub}}<br />
|{{Down|Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Down|File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Down|Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down|HPSS|HPSS}}<br />
|{{Down|Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Down|External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down|Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
May 27th, 2021, 11:30am: The cooling tower issue has been identified as a wiring issue and is being repaired. We don't have an ETA on when cooling will be restored, however we are hopeful it will be by the end of the day. <br />
<br />
May 27th, 2021, 12:30am: Cooling tower motor is not working properly and may need to be replaced. Its the primary motor and the cooling system can not run without it, so at least until tomorrow all equipment at the datacenter will remain unavailable. Updates about expected repair times will be posted when they are known.<br />
<br />
May 26th, 2021, 9:20pm: we are currently experiencing cooling issues at the SciNet data centre. Updates will be posted as we determine the cause of the problem.<br />
<br />
<!-- Announcement: On June 7th and 8th, 2021, The SciNet datacentre will have a scheduled maintenance shutdown. Niagara, Mist, HPSS, login nodes, the file systems, and hosted systems will all be offline during the shutdown. --><br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3025Main Page2021-05-27T05:36:56Z<p>Northrup: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Down|Niagara|Niagara_Quickstart}}<br />
|{{Down|Mist|Mist}}<br />
|{{Down|Teach|Teach}}<br />
|{{Down|Rouge|Rouge}}<br />
|-<br />
|{{Down|Jupyter Hub|Jupyter_Hub}}<br />
|{{Down|Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Down|File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Down|Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down|HPSS|HPSS}}<br />
|{{Down|Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Down|External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down|Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<br />
May 27th, 2021, 12:30am: Cooling tower motor is not working properly and may need to be replaced. Its the primary motor and the cooling system can not run without it, so at least until tomorrow all equipment at the datacenter will remain unavailable. Updates about expected repair times will be posted when they are known.<br />
<br />
May 26th, 2021, 9:20pm: we are currently experiencing cooling issues at the SciNet data centre. Updates will be posted as we determine the cause of the problem.<br />
<br />
<!-- Announcement: On June 7th and 8th, 2021, The SciNet datacentre will have a scheduled maintenance shutdown. Niagara, Mist, HPSS, login nodes, the file systems, and hosted systems will all be offline during the shutdown. --><br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3024Main Page2021-05-27T05:36:27Z<p>Northrup: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Down|Niagara|Niagara_Quickstart}}<br />
|{{Down|Mist|Mist}}<br />
|{{Down|Teach|Teach}}<br />
|{{Down|Rouge|Rouge}}<br />
|-<br />
|{{Down|Jupyter Hub|Jupyter_Hub}}<br />
|{{Down|Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Down|File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Down|Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Down|HPSS|HPSS}}<br />
|{{Down|Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Down|External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Down|Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<br />
May 27th, 2021, 12:30am: Cooling tower motor is not working properly and may need to be replaced. Its the primary motor and the cooling system can not run without it, so at least until tomorrow all equipment at the datacenter will remain unavailable. Updates about expected repair times will be posted as they known.<br />
<br />
May 26th, 2021, 9:20pm: we are currently experiencing cooling issues at the SciNet data centre. Updates will be posted as we determine the cause of the problem.<br />
<br />
<!-- Announcement: On June 7th and 8th, 2021, The SciNet datacentre will have a scheduled maintenance shutdown. Niagara, Mist, HPSS, login nodes, the file systems, and hosted systems will all be offline during the shutdown. --><br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=3015Rouge2021-05-07T14:31:00Z<p>Northrup: /* Full-node job script */</p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd1.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Specifications=<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their [https://www.amd.com/en/corporate/hpc-fund#:~:text=The%20goal%20of%20the%20AMD,potential%20threats%20to%20global%20health COVID-19 HPC Fund ] support program. The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for internode communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
Access and support requests should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
= Getting started on Rouge =<br />
<br />
<!-- <br />
Rouge can be accessed directly.<br />
<pre><br />
ssh -Y MYCCUSERNAME@rouge.scinet.utoronto.ca<br />
--><br />
<br />
<br />
Rouge login node '''rouge-login01''' can be accessed via the Niagara cluster.<br />
<pre><br />
ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y rouge-login01<br />
</pre><br />
<br />
== Storage ==<br />
<br />
The filesystem for Rouge is currently shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on : use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]].<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
<br />
= Available compilers and interpreters =<br />
<br />
* The <tt>Rocm</tt> module has to be loaded first for GPU software.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> module.<br />
<br />
=== ROCm ===<br />
<br />
The current installed ROCm Tookit is '''4.1.0'''<br />
<pre><br />
module load rocm/<version><br />
</pre><br />
*A compiler (GCC or rocm-clang) module must be loaded in order to use ROCm to build any code.<br />
<br />
The current AMD driver version is 5.9.15. Use '''rocm-smi -a''' for full details.<br />
<br />
===GNU Compilers ===<br />
<br />
Available GCC modules are:<br />
<pre><br />
gcc/10.3.0<br />
rocm-clang/4.1.0<br />
hipify-clang/12.0.0<br />
aocc/3.0.0<br />
</pre><br />
<br />
=== OpenMPI ===<br />
<tt>openmpi/<version></tt> module is avaiable with differentcompilers.<br />
<br />
= Software =<br />
<br />
== Singularity Containers ==<br />
<pre><br />
/scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/lammps.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/openmm.rocm401.ubuntu18.sif<br />
</pre><br />
<br />
== GROMACS ==<br />
The HIP version of GROMACS 2020.3 (better performance than OpenCL version) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env OMP_PLACES=cores /scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif gmx mdrun -pin off -ntmpi 1 -ntomp 6 ......<br />
<br />
# setting '-ntomp 4' might give better performance, do your own benchmark. not recommended to set larger than 6 for single GPU job<br />
# if you worry about 'GPU update with domain decomposition lacks substantial testing and should be used with caution.' warning message (if there is any), add '-update cpu' to override<br />
</pre><br />
<br />
== NAMD ==<br />
The HIP version of NAMD (3.0a) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env LD_LIBRARY_PATH=/opt/rocm/lib:/.singularity.d/libs /scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif namd2 +idlepoll +p 12 stmv.namd<br />
# do not set +p flag larger than 12, there are only 6 cores (12 threads) per single GPU job.<br />
</pre><br />
<br />
== PyTorch ==<br />
Install PyTorch into a python virtual environment:<br />
<pre><br />
module load python gcc<br />
mkdir -p ~/.virtualenvs<br />
virtualenv --system-site-packages ~/.virtualenvs/pytorch-rocm<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
pip3 install torch -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html<br />
pip3 install ninja && pip3 install 'git+https://github.com/pytorch/vision.git@v0.9.1'<br />
</pre><br />
Run PyTorch job with single GPU:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load python gcc<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
python code.py<br />
</pre><br />
<br />
= Testing and debugging =<br />
<br />
You really should test your code before you submit it to the cluster to know if your code is correct and what kind of resources you need.<br />
* Small test jobs can be run on the login node. Rule of thumb: tests should run no more than a couple of minutes, taking at most about 1-2GB of memory, and use no more than one gpu and a few cores.<br />
<br />
* Short tests that do not fit on a login node, or for which you need a dedicated node, request an interactive debug job with the debug command:<br />
<br />
<pre><br />
rouge-login01:~$ debugjob --clean -g G=1<br />
</pre> <br />
<br />
where G is the number of gpus. If G=1, this gives an interactive session for 2 hours, whereas G=4 gets you a node with 4 gpus for 30 minutes, and with G=8 (the maximum) gets you a full node with 8 gpus for 30 minutes. The <tt>--clean</tt> argument is optional but recommended as it will start the session without any modules loaded, thus mimicking more closely what happens when you submit a job script.<br />
<br />
= Submitting jobs =<br />
Once you have compiled and tested your code or workflow on the Rouge login nodes, and confirmed that it behaves correctly, you are ready to submit jobs to the cluster. Your jobs will run on one of Rouge's 20 compute nodes. When and where your job runs is determined by the scheduler.<br />
<br />
Rouge uses SLURM as its job scheduler. <br />
<br />
You submit jobs from a login node by passing a script to the sbatch command:<br />
<br />
<pre><br />
rouge-login01:scratch$ sbatch jobscript.sh<br />
</pre><br />
<br />
This puts the job in the queue. It will run on the compute nodes in due course. In most cases, you should not submit from your $HOME directory, but rather, from your $SCRATCH directory, so that the output of your compute job can be written out (as mentioned above, $HOME is read-only on the compute nodes).<br />
<br />
Example job scripts can be found below.<br />
Keep in mind:<br />
* Scheduling is by gpu each with 6 CPU cores.<br />
* Your job's maximum walltime is 24 hours. <br />
* Jobs must write their output to your scratch or project directory (home is read-only on compute nodes).<br />
* Compute nodes have no internet access.<br />
* Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below).<br />
<br />
== Single-GPU job script ==<br />
For a single GPU job, each will have a 1/8 of the node which is 1 GPU + 6/48 CPU Cores/Threads + ~64GB CPU memory. '''Users should never ask CPU or Memory explicitly.''' If running MPI program, user can set --ntasks to be the number of MPI ranks. '''Do NOT set --ntasks for non-MPI programs.''' <br />
<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
<br />
module load <modules you need><br />
Run your program<br />
</pre><br />
<br />
== Full-node job script ==<br />
'''If you are not sure the program can be executed on multiple GPUs, please follow the single-gpu job instruction above or contact SciNet/SOSCIP support.'''<br />
<br />
Multi-GPU job should ask for a minimum of one full node (8 GPUs). User need to specify "compute_full_node" partition in order to get all resource on a node. <br />
*An example for a 1-node job:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=8<br />
#SBATCH --ntasks=8 #this only affects MPI job<br />
#SBATCH --time=1:00:00<br />
#SBATCH -p compute_full_node<br />
<br />
module load <modules you need><br />
Run your program<br />
</pre></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=3014Rouge2021-05-07T14:26:58Z<p>Northrup: /* Submitting jobs */</p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd1.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Specifications=<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their [https://www.amd.com/en/corporate/hpc-fund#:~:text=The%20goal%20of%20the%20AMD,potential%20threats%20to%20global%20health COVID-19 HPC Fund ] support program. The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for internode communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
Access and support requests should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
= Getting started on Rouge =<br />
<br />
<!-- <br />
Rouge can be accessed directly.<br />
<pre><br />
ssh -Y MYCCUSERNAME@rouge.scinet.utoronto.ca<br />
--><br />
<br />
<br />
Rouge login node '''rouge-login01''' can be accessed via the Niagara cluster.<br />
<pre><br />
ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y rouge-login01<br />
</pre><br />
<br />
== Storage ==<br />
<br />
The filesystem for Rouge is currently shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on : use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]].<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
<br />
= Available compilers and interpreters =<br />
<br />
* The <tt>Rocm</tt> module has to be loaded first for GPU software.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> module.<br />
<br />
=== ROCm ===<br />
<br />
The current installed ROCm Tookit is '''4.1.0'''<br />
<pre><br />
module load rocm/<version><br />
</pre><br />
*A compiler (GCC or rocm-clang) module must be loaded in order to use ROCm to build any code.<br />
<br />
The current AMD driver version is 5.9.15. Use '''rocm-smi -a''' for full details.<br />
<br />
===GNU Compilers ===<br />
<br />
Available GCC modules are:<br />
<pre><br />
gcc/10.3.0<br />
rocm-clang/4.1.0<br />
hipify-clang/12.0.0<br />
aocc/3.0.0<br />
</pre><br />
<br />
=== OpenMPI ===<br />
<tt>openmpi/<version></tt> module is avaiable with differentcompilers.<br />
<br />
= Software =<br />
<br />
== Singularity Containers ==<br />
<pre><br />
/scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/lammps.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/openmm.rocm401.ubuntu18.sif<br />
</pre><br />
<br />
== GROMACS ==<br />
The HIP version of GROMACS 2020.3 (better performance than OpenCL version) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env OMP_PLACES=cores /scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif gmx mdrun -pin off -ntmpi 1 -ntomp 6 ......<br />
<br />
# setting '-ntomp 4' might give better performance, do your own benchmark. not recommended to set larger than 6 for single GPU job<br />
# if you worry about 'GPU update with domain decomposition lacks substantial testing and should be used with caution.' warning message (if there is any), add '-update cpu' to override<br />
</pre><br />
<br />
== NAMD ==<br />
The HIP version of NAMD (3.0a) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env LD_LIBRARY_PATH=/opt/rocm/lib:/.singularity.d/libs /scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif namd2 +idlepoll +p 12 stmv.namd<br />
# do not set +p flag larger than 12, there are only 6 cores (12 threads) per single GPU job.<br />
</pre><br />
<br />
== PyTorch ==<br />
Install PyTorch into a python virtual environment:<br />
<pre><br />
module load python gcc<br />
mkdir -p ~/.virtualenvs<br />
virtualenv --system-site-packages ~/.virtualenvs/pytorch-rocm<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
pip3 install torch -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html<br />
pip3 install ninja && pip3 install 'git+https://github.com/pytorch/vision.git@v0.9.1'<br />
</pre><br />
Run PyTorch job with single GPU:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load python gcc<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
python code.py<br />
</pre><br />
<br />
= Testing and debugging =<br />
<br />
You really should test your code before you submit it to the cluster to know if your code is correct and what kind of resources you need.<br />
* Small test jobs can be run on the login node. Rule of thumb: tests should run no more than a couple of minutes, taking at most about 1-2GB of memory, and use no more than one gpu and a few cores.<br />
<br />
* Short tests that do not fit on a login node, or for which you need a dedicated node, request an interactive debug job with the debug command:<br />
<br />
<pre><br />
rouge-login01:~$ debugjob --clean -g G=1<br />
</pre> <br />
<br />
where G is the number of gpus. If G=1, this gives an interactive session for 2 hours, whereas G=4 gets you a node with 4 gpus for 30 minutes, and with G=8 (the maximum) gets you a full node with 8 gpus for 30 minutes. The <tt>--clean</tt> argument is optional but recommended as it will start the session without any modules loaded, thus mimicking more closely what happens when you submit a job script.<br />
<br />
= Submitting jobs =<br />
Once you have compiled and tested your code or workflow on the Rouge login nodes, and confirmed that it behaves correctly, you are ready to submit jobs to the cluster. Your jobs will run on one of Rouge's 20 compute nodes. When and where your job runs is determined by the scheduler.<br />
<br />
Rouge uses SLURM as its job scheduler. <br />
<br />
You submit jobs from a login node by passing a script to the sbatch command:<br />
<br />
<pre><br />
rouge-login01:scratch$ sbatch jobscript.sh<br />
</pre><br />
<br />
This puts the job in the queue. It will run on the compute nodes in due course. In most cases, you should not submit from your $HOME directory, but rather, from your $SCRATCH directory, so that the output of your compute job can be written out (as mentioned above, $HOME is read-only on the compute nodes).<br />
<br />
Example job scripts can be found below.<br />
Keep in mind:<br />
* Scheduling is by gpu each with 6 CPU cores.<br />
* Your job's maximum walltime is 24 hours. <br />
* Jobs must write their output to your scratch or project directory (home is read-only on compute nodes).<br />
* Compute nodes have no internet access.<br />
* Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below).<br />
<br />
== Single-GPU job script ==<br />
For a single GPU job, each will have a 1/8 of the node which is 1 GPU + 6/48 CPU Cores/Threads + ~64GB CPU memory. '''Users should never ask CPU or Memory explicitly.''' If running MPI program, user can set --ntasks to be the number of MPI ranks. '''Do NOT set --ntasks for non-MPI programs.''' <br />
<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
<br />
module load <modules you need><br />
Run your program<br />
</pre><br />
<br />
== Full-node job script ==<br />
'''If you are not sure the program can be executed on multiple GPUs, please follow the single-gpu job instruction above or contact SciNet/SOSCIP support.'''<br />
<br />
Multi-GPU job should ask for a minimum of one full node (8 GPUs). User need to specify "compute_full_node" partition in order to get all resource on a node. <br />
*An example for a 1-node job:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=8<br />
#SBATCH --ntasks=8 #this only affects MPI job<br />
#SBATCH --time=1:00:00<br />
#SBATCH -p compute_full_node<br />
<br />
<br />
module load <modules you need><br />
Run your program<br />
</pre></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=3013Rouge2021-05-07T14:24:59Z<p>Northrup: /* Submitting jobs */</p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd1.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Specifications=<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their [https://www.amd.com/en/corporate/hpc-fund#:~:text=The%20goal%20of%20the%20AMD,potential%20threats%20to%20global%20health COVID-19 HPC Fund ] support program. The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for internode communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
Access and support requests should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
= Getting started on Rouge =<br />
<br />
<!-- <br />
Rouge can be accessed directly.<br />
<pre><br />
ssh -Y MYCCUSERNAME@rouge.scinet.utoronto.ca<br />
--><br />
<br />
<br />
Rouge login node '''rouge-login01''' can be accessed via the Niagara cluster.<br />
<pre><br />
ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y rouge-login01<br />
</pre><br />
<br />
== Storage ==<br />
<br />
The filesystem for Rouge is currently shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on : use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]].<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
<br />
= Available compilers and interpreters =<br />
<br />
* The <tt>Rocm</tt> module has to be loaded first for GPU software.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> module.<br />
<br />
=== ROCm ===<br />
<br />
The current installed ROCm Tookit is '''4.1.0'''<br />
<pre><br />
module load rocm/<version><br />
</pre><br />
*A compiler (GCC or rocm-clang) module must be loaded in order to use ROCm to build any code.<br />
<br />
The current AMD driver version is 5.9.15. Use '''rocm-smi -a''' for full details.<br />
<br />
===GNU Compilers ===<br />
<br />
Available GCC modules are:<br />
<pre><br />
gcc/10.3.0<br />
rocm-clang/4.1.0<br />
hipify-clang/12.0.0<br />
aocc/3.0.0<br />
</pre><br />
<br />
=== OpenMPI ===<br />
<tt>openmpi/<version></tt> module is avaiable with differentcompilers.<br />
<br />
= Software =<br />
<br />
== Singularity Containers ==<br />
<pre><br />
/scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/lammps.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/openmm.rocm401.ubuntu18.sif<br />
</pre><br />
<br />
== GROMACS ==<br />
The HIP version of GROMACS 2020.3 (better performance than OpenCL version) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env OMP_PLACES=cores /scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif gmx mdrun -pin off -ntmpi 1 -ntomp 6 ......<br />
<br />
# setting '-ntomp 4' might give better performance, do your own benchmark. not recommended to set larger than 6 for single GPU job<br />
# if you worry about 'GPU update with domain decomposition lacks substantial testing and should be used with caution.' warning message (if there is any), add '-update cpu' to override<br />
</pre><br />
<br />
== NAMD ==<br />
The HIP version of NAMD (3.0a) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env LD_LIBRARY_PATH=/opt/rocm/lib:/.singularity.d/libs /scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif namd2 +idlepoll +p 12 stmv.namd<br />
# do not set +p flag larger than 12, there are only 6 cores (12 threads) per single GPU job.<br />
</pre><br />
<br />
== PyTorch ==<br />
Install PyTorch into a python virtual environment:<br />
<pre><br />
module load python gcc<br />
mkdir -p ~/.virtualenvs<br />
virtualenv --system-site-packages ~/.virtualenvs/pytorch-rocm<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
pip3 install torch -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html<br />
pip3 install ninja && pip3 install 'git+https://github.com/pytorch/vision.git@v0.9.1'<br />
</pre><br />
Run PyTorch job with single GPU:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load python gcc<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
python code.py<br />
</pre><br />
<br />
= Testing and debugging =<br />
<br />
You really should test your code before you submit it to the cluster to know if your code is correct and what kind of resources you need.<br />
* Small test jobs can be run on the login node. Rule of thumb: tests should run no more than a couple of minutes, taking at most about 1-2GB of memory, and use no more than one gpu and a few cores.<br />
<br />
* Short tests that do not fit on a login node, or for which you need a dedicated node, request an interactive debug job with the debug command:<br />
<br />
<pre><br />
rouge-login01:~$ debugjob --clean -g G=1<br />
</pre> <br />
<br />
where G is the number of gpus. If G=1, this gives an interactive session for 2 hours, whereas G=4 gets you a node with 4 gpus for 30 minutes, and with G=8 (the maximum) gets you a full node with 8 gpus for 30 minutes. The <tt>--clean</tt> argument is optional but recommended as it will start the session without any modules loaded, thus mimicking more closely what happens when you submit a job script.<br />
<br />
= Submitting jobs =<br />
Once you have compiled and tested your code or workflow on the Rouge login nodes, and confirmed that it behaves correctly, you are ready to submit jobs to the cluster. Your jobs will run on one of Rouge's 20 compute nodes. When and where your job runs is determined by the scheduler.<br />
<br />
Rouge uses SLURM as its job scheduler. It is configured to allow only '''Single-GPU jobs''' and '''Full-node jobs (4 GPUs per node)'''.<br />
<br />
You submit jobs from a login node by passing a script to the sbatch command:<br />
<br />
<pre><br />
rouge-login01:scratch$ sbatch jobscript.sh<br />
</pre><br />
<br />
This puts the job in the queue. It will run on the compute nodes in due course. In most cases, you should not submit from your $HOME directory, but rather, from your $SCRATCH directory, so that the output of your compute job can be written out (as mentioned above, $HOME is read-only on the compute nodes).<br />
<br />
Example job scripts can be found below.<br />
Keep in mind:<br />
* Scheduling is by gpu each with 6 CPU cores.<br />
* Your job's maximum walltime is 24 hours. <br />
* Jobs must write their output to your scratch or project directory (home is read-only on compute nodes).<br />
* Compute nodes have no internet access.<br />
* Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below).<br />
<br />
== Single-GPU job script ==<br />
For a single GPU job, each will have a 1/8 of the node which is 1 GPU + 6/48 CPU Cores/Threads + ~64GB CPU memory. '''Users should never ask CPU or Memory explicitly.''' If running MPI program, user can set --ntasks to be the number of MPI ranks. '''Do NOT set --ntasks for non-MPI programs.''' <br />
<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
#SBATCH --time=1:00:0<br />
<br />
module load <modules you need><br />
Run your program<br />
</pre><br />
<br />
== Full-node job script ==<br />
'''If you are not sure the program can be executed on multiple GPUs, please follow the single-gpu job instruction above or contact SciNet/SOSCIP support.'''<br />
<br />
Multi-GPU job should ask for a minimum of one full node (8 GPUs). User need to specify "compute_full_node" partition in order to get all resource on a node. <br />
*An example for a 1-node job:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=8<br />
#SBATCH --ntasks=8 #this only affects MPI job<br />
#SBATCH --time=1:00:00<br />
#SBATCH -p compute_full_node<br />
<br />
<br />
module load <modules you need><br />
Run your program<br />
</pre></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=3012Rouge2021-05-07T14:19:31Z<p>Northrup: /* Testing and debugging */</p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd1.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Specifications=<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their [https://www.amd.com/en/corporate/hpc-fund#:~:text=The%20goal%20of%20the%20AMD,potential%20threats%20to%20global%20health COVID-19 HPC Fund ] support program. The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for internode communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
Access and support requests should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
= Getting started on Rouge =<br />
<br />
<!-- <br />
Rouge can be accessed directly.<br />
<pre><br />
ssh -Y MYCCUSERNAME@rouge.scinet.utoronto.ca<br />
--><br />
<br />
<br />
Rouge login node '''rouge-login01''' can be accessed via the Niagara cluster.<br />
<pre><br />
ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y rouge-login01<br />
</pre><br />
<br />
== Storage ==<br />
<br />
The filesystem for Rouge is currently shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on : use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]].<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
<br />
= Available compilers and interpreters =<br />
<br />
* The <tt>Rocm</tt> module has to be loaded first for GPU software.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> module.<br />
<br />
=== ROCm ===<br />
<br />
The current installed ROCm Tookit is '''4.1.0'''<br />
<pre><br />
module load rocm/<version><br />
</pre><br />
*A compiler (GCC or rocm-clang) module must be loaded in order to use ROCm to build any code.<br />
<br />
The current AMD driver version is 5.9.15. Use '''rocm-smi -a''' for full details.<br />
<br />
===GNU Compilers ===<br />
<br />
Available GCC modules are:<br />
<pre><br />
gcc/10.3.0<br />
rocm-clang/4.1.0<br />
hipify-clang/12.0.0<br />
aocc/3.0.0<br />
</pre><br />
<br />
=== OpenMPI ===<br />
<tt>openmpi/<version></tt> module is avaiable with differentcompilers.<br />
<br />
= Software =<br />
<br />
== Singularity Containers ==<br />
<pre><br />
/scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/lammps.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/openmm.rocm401.ubuntu18.sif<br />
</pre><br />
<br />
== GROMACS ==<br />
The HIP version of GROMACS 2020.3 (better performance than OpenCL version) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env OMP_PLACES=cores /scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif gmx mdrun -pin off -ntmpi 1 -ntomp 6 ......<br />
<br />
# setting '-ntomp 4' might give better performance, do your own benchmark. not recommended to set larger than 6 for single GPU job<br />
# if you worry about 'GPU update with domain decomposition lacks substantial testing and should be used with caution.' warning message (if there is any), add '-update cpu' to override<br />
</pre><br />
<br />
== NAMD ==<br />
The HIP version of NAMD (3.0a) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env LD_LIBRARY_PATH=/opt/rocm/lib:/.singularity.d/libs /scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif namd2 +idlepoll +p 12 stmv.namd<br />
# do not set +p flag larger than 12, there are only 6 cores (12 threads) per single GPU job.<br />
</pre><br />
<br />
== PyTorch ==<br />
Install PyTorch into a python virtual environment:<br />
<pre><br />
module load python gcc<br />
mkdir -p ~/.virtualenvs<br />
virtualenv --system-site-packages ~/.virtualenvs/pytorch-rocm<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
pip3 install torch -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html<br />
pip3 install ninja && pip3 install 'git+https://github.com/pytorch/vision.git@v0.9.1'<br />
</pre><br />
Run PyTorch job with single GPU:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load python gcc<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
python code.py<br />
</pre><br />
<br />
= Testing and debugging =<br />
<br />
You really should test your code before you submit it to the cluster to know if your code is correct and what kind of resources you need.<br />
* Small test jobs can be run on the login node. Rule of thumb: tests should run no more than a couple of minutes, taking at most about 1-2GB of memory, and use no more than one gpu and a few cores.<br />
<br />
* Short tests that do not fit on a login node, or for which you need a dedicated node, request an interactive debug job with the debug command:<br />
<br />
<pre><br />
rouge-login01:~$ debugjob --clean -g G=1<br />
</pre> <br />
<br />
where G is the number of gpus. If G=1, this gives an interactive session for 2 hours, whereas G=4 gets you a node with 4 gpus for 30 minutes, and with G=8 (the maximum) gets you a full node with 8 gpus for 30 minutes. The <tt>--clean</tt> argument is optional but recommended as it will start the session without any modules loaded, thus mimicking more closely what happens when you submit a job script.<br />
<br />
= Submitting jobs =<br />
Once you have compiled and tested your code or workflow on the Rouge login nodes, and confirmed that it behaves correctly, you are ready to submit jobs to the cluster. Your jobs will run on one of Rouge's 20 compute nodes. When and where your job runs is determined by the scheduler.<br />
<br />
Rouge uses SLURM as its job scheduler. It is configured to allow only '''Single-GPU jobs''' and '''Full-node jobs (4 GPUs per node)'''.<br />
<br />
You submit jobs from a login node by passing a script to the sbatch command:<br />
<br />
rouge-login01:scratch$ sbatch jobscript.sh<br />
<br />
This puts the job in the queue. It will run on the compute nodes in due course. In most cases, you should not submit from your $HOME directory, but rather, from your $SCRATCH directory, so that the output of your compute job can be written out (as mentioned above, $HOME is read-only on the compute nodes).<br />
<br />
Example job scripts can be found below.<br />
Keep in mind:<br />
* Scheduling is by gpu each with 6 CPU cores.<br />
* Your job's maximum walltime is 24 hours. <br />
* Jobs must write their output to your scratch or project directory (home is read-only on compute nodes).<br />
* Compute nodes have no internet access.<br />
* Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below).</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=3011Rouge2021-05-07T14:14:10Z<p>Northrup: </p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd1.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Specifications=<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their [https://www.amd.com/en/corporate/hpc-fund#:~:text=The%20goal%20of%20the%20AMD,potential%20threats%20to%20global%20health COVID-19 HPC Fund ] support program. The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for internode communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
Access and support requests should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
= Getting started on Rouge =<br />
<br />
<!-- <br />
Rouge can be accessed directly.<br />
<pre><br />
ssh -Y MYCCUSERNAME@rouge.scinet.utoronto.ca<br />
--><br />
<br />
<br />
Rouge login node '''rouge-login01''' can be accessed via the Niagara cluster.<br />
<pre><br />
ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y rouge-login01<br />
</pre><br />
<br />
== Storage ==<br />
<br />
The filesystem for Rouge is currently shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on : use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]].<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
<br />
= Available compilers and interpreters =<br />
<br />
* The <tt>Rocm</tt> module has to be loaded first for GPU software.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> module.<br />
<br />
=== ROCm ===<br />
<br />
The current installed ROCm Tookit is '''4.1.0'''<br />
<pre><br />
module load rocm/<version><br />
</pre><br />
*A compiler (GCC or rocm-clang) module must be loaded in order to use ROCm to build any code.<br />
<br />
The current AMD driver version is 5.9.15. Use '''rocm-smi -a''' for full details.<br />
<br />
===GNU Compilers ===<br />
<br />
Available GCC modules are:<br />
<pre><br />
gcc/10.3.0<br />
rocm-clang/4.1.0<br />
hipify-clang/12.0.0<br />
aocc/3.0.0<br />
</pre><br />
<br />
=== OpenMPI ===<br />
<tt>openmpi/<version></tt> module is avaiable with differentcompilers.<br />
<br />
= Software =<br />
<br />
== Singularity Containers ==<br />
<pre><br />
/scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/lammps.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/openmm.rocm401.ubuntu18.sif<br />
</pre><br />
<br />
== GROMACS ==<br />
The HIP version of GROMACS 2020.3 (better performance than OpenCL version) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env OMP_PLACES=cores /scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif gmx mdrun -pin off -ntmpi 1 -ntomp 6 ......<br />
<br />
# setting '-ntomp 4' might give better performance, do your own benchmark. not recommended to set larger than 6 for single GPU job<br />
# if you worry about 'GPU update with domain decomposition lacks substantial testing and should be used with caution.' warning message (if there is any), add '-update cpu' to override<br />
</pre><br />
<br />
== NAMD ==<br />
The HIP version of NAMD (3.0a) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env LD_LIBRARY_PATH=/opt/rocm/lib:/.singularity.d/libs /scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif namd2 +idlepoll +p 12 stmv.namd<br />
# do not set +p flag larger than 12, there are only 6 cores (12 threads) per single GPU job.<br />
</pre><br />
<br />
== PyTorch ==<br />
Install PyTorch into a python virtual environment:<br />
<pre><br />
module load python gcc<br />
mkdir -p ~/.virtualenvs<br />
virtualenv --system-site-packages ~/.virtualenvs/pytorch-rocm<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
pip3 install torch -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html<br />
pip3 install ninja && pip3 install 'git+https://github.com/pytorch/vision.git@v0.9.1'<br />
</pre><br />
Run PyTorch job with single GPU:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load python gcc<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
python code.py<br />
</pre><br />
<br />
= Testing and debugging =<br />
<br />
You really should test your code before you submit it to the cluster to know if your code is correct and what kind of resources you need.<br />
* Small test jobs can be run on the login node. Rule of thumb: tests should run no more than a couple of minutes, taking at most about 1-2GB of memory, and use no more than one gpu and a few cores.<br />
<br />
* Short tests that do not fit on a login node, or for which you need a dedicated node, request an interactive debug job with the debug command:<br />
rouge-login01:~$ debugjob --clean -g G<br />
where G is the number of gpus, If G=1, this gives an interactive session for 2 hours, whereas G=4 gets you a single node with 4 gpus for 30 minutes, and with G=8 (the maximum) gets you 2 nodes each with 4 gpus for 30 minutes. The <tt>--clean</tt> argument is optional but recommended as it will start the session without any modules loaded, thus mimicking more closely what happens when you submit a job script.<br />
<br />
= Submitting jobs =<br />
Once you have compiled and tested your code or workflow on the Rouge login nodes, and confirmed that it behaves correctly, you are ready to submit jobs to the cluster. Your jobs will run on one of Rouge's 20 compute nodes. When and where your job runs is determined by the scheduler.<br />
<br />
Rouge uses SLURM as its job scheduler. It is configured to allow only '''Single-GPU jobs''' and '''Full-node jobs (4 GPUs per node)'''.<br />
<br />
You submit jobs from a login node by passing a script to the sbatch command:<br />
<br />
rouge-login01:scratch$ sbatch jobscript.sh<br />
<br />
This puts the job in the queue. It will run on the compute nodes in due course. In most cases, you should not submit from your $HOME directory, but rather, from your $SCRATCH directory, so that the output of your compute job can be written out (as mentioned above, $HOME is read-only on the compute nodes).<br />
<br />
Example job scripts can be found below.<br />
Keep in mind:<br />
* Scheduling is by gpu each with 6 CPU cores.<br />
* Your job's maximum walltime is 24 hours. <br />
* Jobs must write their output to your scratch or project directory (home is read-only on compute nodes).<br />
* Compute nodes have no internet access.<br />
* Your job script will not remember the modules you have loaded, so it needs to contain "module load" commands of all the required modules (see examples below).</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3010Main Page2021-05-07T13:49:18Z<p>Northrup: /* System Status */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up|Niagara|Niagara_Quickstart}}<br />
|{{Up|Mist|Mist}}<br />
|{{Up|Teach|Teach}}<br />
|{{Up|Rouge|Rouge}}<br />
|-<br />
|{{Up|Jupyter Hub|Jupyter_Hub}}<br />
|{{Up|Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up|File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up|Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up|HPSS|HPSS}}<br />
|{{Up|Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up|External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up|Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<br />
Announcement: On April 26th, 2021 at 2:00 PM EST, the JupyterHub will be rebooted for an OS update. Please save any notebooks you may have running there.<br />
<br />
Announcement: On June 7th and 8th, 2021, The SciNet datacentre will have a scheduled maintenance shutdown. Niagara, Mist, HPSS, login nodes, the file systems, and hosted systems will all be offline during the shutdown. <br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=3009Main Page2021-05-07T13:48:36Z<p>Northrup: </p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up|Niagara|Niagara_Quickstart}}<br />
|{{Up|HPSS|HPSS}}<br />
|{{Up|Mist|Mist}}<br />
|{{Up|Teach|Teach}}<br />
|{{Up|Rouge|Rouge}}<br />
|-<br />
|{{Up|Jupyter Hub|Jupyter_Hub}}<br />
|{{Up|Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up|File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up|Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up|Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up|External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up|Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<br />
Announcement: On April 26th, 2021 at 2:00 PM EST, the JupyterHub will be rebooted for an OS update. Please save any notebooks you may have running there.<br />
<br />
Announcement: On June 7th and 8th, 2021, The SciNet datacentre will have a scheduled maintenance shutdown. Niagara, Mist, HPSS, login nodes, the file systems, and hosted systems will all be offline during the shutdown. <br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=3008Rouge2021-05-07T13:45:38Z<p>Northrup: /* Available compilers and interpreters */</p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd1.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Specifications=<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their [https://www.amd.com/en/corporate/hpc-fund#:~:text=The%20goal%20of%20the%20AMD,potential%20threats%20to%20global%20health COVID-19 HPC Fund ] support program. The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for internode communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
Access and support requests should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
= Getting started on Rouge =<br />
<br />
<!-- <br />
Rouge can be accessed directly.<br />
<pre><br />
ssh -Y MYCCUSERNAME@rouge.scinet.utoronto.ca<br />
--><br />
<br />
<br />
Rouge login node '''rouge-login01''' can be accessed via the Niagara cluster.<br />
<pre><br />
ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y rouge-login01<br />
</pre><br />
<br />
== Storage ==<br />
<br />
The filesystem for Rouge is currently shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on : use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]].<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
<br />
= Available compilers and interpreters =<br />
<br />
* The <tt>Rocm</tt> module has to be loaded first for GPU software.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> module.<br />
<br />
=== ROCm ===<br />
<br />
The current installed ROCm Tookit is '''4.1.0'''<br />
<pre><br />
module load rocm/<version><br />
</pre><br />
*A compiler (GCC or rocm-clang) module must be loaded in order to use ROCm to build any code.<br />
<br />
The current AMD driver version is 5.9.15. Use '''rocm-smi -a''' for full details.<br />
<br />
===GNU Compilers ===<br />
<br />
Available GCC modules are:<br />
<pre><br />
gcc/10.3.0<br />
rocm-clang/4.1.0<br />
hipify-clang/12.0.0<br />
aocc/3.0.0<br />
</pre><br />
<br />
=== OpenMPI ===<br />
<tt>openmpi/<version></tt> module is avaiable with differentcompilers.<br />
<br />
= Software =<br />
<br />
== Singularity Containers ==<br />
<pre><br />
/scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/lammps.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/openmm.rocm401.ubuntu18.sif<br />
</pre><br />
<br />
== GROMACS ==<br />
The HIP version of GROMACS 2020.3 (better performance than OpenCL version) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env OMP_PLACES=cores /scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif gmx mdrun -pin off -ntmpi 1 -ntomp 6 ......<br />
<br />
# setting '-ntomp 4' might give better performance, do your own benchmark. not recommended to set larger than 6 for single GPU job<br />
# if you worry about 'GPU update with domain decomposition lacks substantial testing and should be used with caution.' warning message (if there is any), add '-update cpu' to override<br />
</pre><br />
<br />
== NAMD ==<br />
The HIP version of NAMD (3.0a) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env LD_LIBRARY_PATH=/opt/rocm/lib:/.singularity.d/libs /scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif namd2 +idlepoll +p 12 stmv.namd<br />
# do not set +p flag larger than 12, there are only 6 cores (12 threads) per single GPU job.<br />
</pre><br />
<br />
== PyTorch ==<br />
Install PyTorch into a python virtual environment:<br />
<pre><br />
module load python gcc<br />
mkdir -p ~/.virtualenvs<br />
virtualenv --system-site-packages ~/.virtualenvs/pytorch-rocm<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
pip3 install torch -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html<br />
pip3 install ninja && pip3 install 'git+https://github.com/pytorch/vision.git@v0.9.1'<br />
</pre><br />
Run PyTorch job with single GPU:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load python gcc<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
python code.py<br />
</pre></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=3007Rouge2021-05-07T13:45:28Z<p>Northrup: /* Available compilers and interpreters */</p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd1.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Specifications=<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their [https://www.amd.com/en/corporate/hpc-fund#:~:text=The%20goal%20of%20the%20AMD,potential%20threats%20to%20global%20health COVID-19 HPC Fund ] support program. The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for internode communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
Access and support requests should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
= Getting started on Rouge =<br />
<br />
<!-- <br />
Rouge can be accessed directly.<br />
<pre><br />
ssh -Y MYCCUSERNAME@rouge.scinet.utoronto.ca<br />
--><br />
<br />
<br />
Rouge login node '''rouge-login01''' can be accessed via the Niagara cluster.<br />
<pre><br />
ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y rouge-login01<br />
</pre><br />
<br />
== Storage ==<br />
<br />
The filesystem for Rouge is currently shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on : use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]].<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
<br />
= Available compilers and interpreters =<br />
<br />
* The <tt>Rocm</tt> module has to be loaded first for GPU software.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> module.<br />
<br />
=== ROCm ===<br />
<br />
The current installed ROCm Tookit is '''4.1.0'''<br />
<pre><br />
module load rocm/<version><br />
</pre><br />
*A compiler (GCC or rocm-clang) module must be loaded in order to use ROCm to build any code.<br />
<br />
The current AMD driver version is 5.9.15. Use '''rocm-smi -a''' for full details.<br />
<br />
===GNU Compilers ===<br />
<br />
Available GCC modules are:<br />
<pre><br />
gcc/10.3.0<br />
rocm-clang/4.1.0<br />
hipify-clang/12.0.0<br />
aocc/3.0.0<br />
</pre><br />
<br />
<br />
=== OpenMPI ===<br />
<tt>openmpi/<version></tt> module is avaiable with differentcompilers.<br />
<br />
= Software =<br />
<br />
== Singularity Containers ==<br />
<pre><br />
/scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/lammps.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/openmm.rocm401.ubuntu18.sif<br />
</pre><br />
<br />
== GROMACS ==<br />
The HIP version of GROMACS 2020.3 (better performance than OpenCL version) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env OMP_PLACES=cores /scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif gmx mdrun -pin off -ntmpi 1 -ntomp 6 ......<br />
<br />
# setting '-ntomp 4' might give better performance, do your own benchmark. not recommended to set larger than 6 for single GPU job<br />
# if you worry about 'GPU update with domain decomposition lacks substantial testing and should be used with caution.' warning message (if there is any), add '-update cpu' to override<br />
</pre><br />
<br />
== NAMD ==<br />
The HIP version of NAMD (3.0a) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env LD_LIBRARY_PATH=/opt/rocm/lib:/.singularity.d/libs /scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif namd2 +idlepoll +p 12 stmv.namd<br />
# do not set +p flag larger than 12, there are only 6 cores (12 threads) per single GPU job.<br />
</pre><br />
<br />
== PyTorch ==<br />
Install PyTorch into a python virtual environment:<br />
<pre><br />
module load python gcc<br />
mkdir -p ~/.virtualenvs<br />
virtualenv --system-site-packages ~/.virtualenvs/pytorch-rocm<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
pip3 install torch -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html<br />
pip3 install ninja && pip3 install 'git+https://github.com/pytorch/vision.git@v0.9.1'<br />
</pre><br />
Run PyTorch job with single GPU:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load python gcc<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
python code.py<br />
</pre></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=3006Rouge2021-05-07T13:34:20Z<p>Northrup: /* Loading software modules */</p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd1.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Specifications=<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their [https://www.amd.com/en/corporate/hpc-fund#:~:text=The%20goal%20of%20the%20AMD,potential%20threats%20to%20global%20health COVID-19 HPC Fund ] support program. The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for internode communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
Access and support requests should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
= Getting started on Rouge =<br />
<br />
<!-- <br />
Rouge can be accessed directly.<br />
<pre><br />
ssh -Y MYCCUSERNAME@rouge.scinet.utoronto.ca<br />
--><br />
<br />
<br />
Rouge login node '''rouge-login01''' can be accessed via the Niagara cluster.<br />
<pre><br />
ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y rouge-login01<br />
</pre><br />
<br />
== Storage ==<br />
<br />
The filesystem for Rouge is currently shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on : use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]].<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
<br />
= Available compilers and interpreters =<br />
<br />
* The <tt>Rocm</tt> module has to be loaded first for GPU software.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> module.<br />
<br />
=== ROCm ===<br />
<br />
= Software =<br />
<br />
== Singularity Containers ==<br />
<pre><br />
/scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/lammps.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/openmm.rocm401.ubuntu18.sif<br />
</pre><br />
<br />
== GROMACS ==<br />
The HIP version of GROMACS 2020.3 (better performance than OpenCL version) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env OMP_PLACES=cores /scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif gmx mdrun -pin off -ntmpi 1 -ntomp 6 ......<br />
<br />
# setting '-ntomp 4' might give better performance, do your own benchmark. not recommended to set larger than 6 for single GPU job<br />
# if you worry about 'GPU update with domain decomposition lacks substantial testing and should be used with caution.' warning message (if there is any), add '-update cpu' to override<br />
</pre><br />
<br />
== NAMD ==<br />
The HIP version of NAMD (3.0a) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env LD_LIBRARY_PATH=/opt/rocm/lib:/.singularity.d/libs /scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif namd2 +idlepoll +p 12 stmv.namd<br />
# do not set +p flag larger than 12, there are only 6 cores (12 threads) per single GPU job.<br />
</pre><br />
<br />
== PyTorch ==<br />
Install PyTorch into a python virtual environment:<br />
<pre><br />
module load python gcc<br />
mkdir -p ~/.virtualenvs<br />
virtualenv --system-site-packages ~/.virtualenvs/pytorch-rocm<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
pip3 install torch -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html<br />
pip3 install ninja && pip3 install 'git+https://github.com/pytorch/vision.git@v0.9.1'<br />
</pre><br />
Run PyTorch job with single GPU:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load python gcc<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
python code.py<br />
</pre></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=3005Rouge2021-05-07T13:32:25Z<p>Northrup: </p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd1.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Specifications=<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their [https://www.amd.com/en/corporate/hpc-fund#:~:text=The%20goal%20of%20the%20AMD,potential%20threats%20to%20global%20health COVID-19 HPC Fund ] support program. The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for internode communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
Access and support requests should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
= Getting started on Rouge =<br />
<br />
<!-- <br />
Rouge can be accessed directly.<br />
<pre><br />
ssh -Y MYCCUSERNAME@rouge.scinet.utoronto.ca<br />
--><br />
<br />
<br />
Rouge login node '''rouge-login01''' can be accessed via the Niagara cluster.<br />
<pre><br />
ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y rouge-login01<br />
</pre><br />
<br />
== Storage ==<br />
<br />
The filesystem for Rouge is currently shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on : use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]].<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
<br />
<br />
= Available compilers and interpreters =<br />
<br />
* The <tt>Rocm</tt> module has to be loaded first for GPU software.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> module.<br />
<br />
=== ROCm ===<br />
<br />
= Software =<br />
<br />
== Singularity Containers ==<br />
<pre><br />
/scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/lammps.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/openmm.rocm401.ubuntu18.sif<br />
</pre><br />
<br />
== GROMACS ==<br />
The HIP version of GROMACS 2020.3 (better performance than OpenCL version) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env OMP_PLACES=cores /scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif gmx mdrun -pin off -ntmpi 1 -ntomp 6 ......<br />
<br />
# setting '-ntomp 4' might give better performance, do your own benchmark. not recommended to set larger than 6 for single GPU job<br />
# if you worry about 'GPU update with domain decomposition lacks substantial testing and should be used with caution.' warning message (if there is any), add '-update cpu' to override<br />
</pre><br />
<br />
== NAMD ==<br />
The HIP version of NAMD (3.0a) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env LD_LIBRARY_PATH=/opt/rocm/lib:/.singularity.d/libs /scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif namd2 +idlepoll +p 12 stmv.namd<br />
# do not set +p flag larger than 12, there are only 6 cores (12 threads) per single GPU job.<br />
</pre><br />
<br />
== PyTorch ==<br />
Install PyTorch into a python virtual environment:<br />
<pre><br />
module load python gcc<br />
mkdir -p ~/.virtualenvs<br />
virtualenv --system-site-packages ~/.virtualenvs/pytorch-rocm<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
pip3 install torch -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html<br />
pip3 install ninja && pip3 install 'git+https://github.com/pytorch/vision.git@v0.9.1'<br />
</pre><br />
Run PyTorch job with single GPU:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load python gcc<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
python code.py<br />
</pre></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=3004Rouge2021-05-07T13:28:33Z<p>Northrup: </p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd1.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Specifications=<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their [https://www.amd.com/en/corporate/hpc-fund#:~:text=The%20goal%20of%20the%20AMD,potential%20threats%20to%20global%20health COVID-19 HPC Fund ] support program. The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for MPI communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
Questions about its use or problems should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
= Getting started on Rouge =<br />
<br />
Rouge can be accessed directly.<br />
<pre><br />
ssh -Y MYCCUSERNAME@rouge.scinet.utoronto.ca<br />
</pre><br />
Rouge login node '''rouge-login01''' can also be accessed via Niagara cluster.<br />
<pre><br />
ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y rouge-login01<br />
</pre><br />
<br />
== Storage ==<br />
<br />
The filesystem for Rouge is currently shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on : use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]].<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
<br />
<br />
= Available compilers and interpreters =<br />
<br />
* The <tt>Rocm</tt> module has to be loaded first for GPU software.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> module.<br />
<br />
=== ROCm ===<br />
<br />
= Software =<br />
<br />
== Singularity Containers ==<br />
<pre><br />
/scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/lammps.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif<br />
/scinet/rouge/amd/containers/openmm.rocm401.ubuntu18.sif<br />
</pre><br />
<br />
== GROMACS ==<br />
The HIP version of GROMACS 2020.3 (better performance than OpenCL version) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env OMP_PLACES=cores /scinet/rouge/amd/containers/gromacs.rocm401.ubuntu18.sif gmx mdrun -pin off -ntmpi 1 -ntomp 6 ......<br />
<br />
# setting '-ntomp 4' might give better performance, do your own benchmark. not recommended to set larger than 6 for single GPU job<br />
# if you worry about 'GPU update with domain decomposition lacks substantial testing and should be used with caution.' warning message (if there is any), add '-update cpu' to override<br />
</pre><br />
<br />
== NAMD ==<br />
The HIP version of NAMD (3.0a) is provided by AMD in a container. Currently it is suggested to use a single GPU for all simulations.<br />
Job example:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
export SINGULARITY_HOME=$SLURM_SUBMIT_DIR<br />
<br />
singularity exec -B /home -B /scratch --env LD_LIBRARY_PATH=/opt/rocm/lib:/.singularity.d/libs /scinet/rouge/amd/containers/namd.rocm401.ubuntu18.sif namd2 +idlepoll +p 12 stmv.namd<br />
# do not set +p flag larger than 12, there are only 6 cores (12 threads) per single GPU job.<br />
</pre><br />
<br />
== PyTorch ==<br />
Install PyTorch into a python virtual environment:<br />
<pre><br />
module load python gcc<br />
mkdir -p ~/.virtualenvs<br />
virtualenv --system-site-packages ~/.virtualenvs/pytorch-rocm<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
pip3 install torch -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html<br />
pip3 install ninja && pip3 install 'git+https://github.com/pytorch/vision.git@v0.9.1'<br />
</pre><br />
Run PyTorch job with single GPU:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --time=1:00:00<br />
#SBATCH --nodes=1<br />
#SBATCH --gpus-per-node=1<br />
<br />
module load python gcc<br />
source ~/.virtualenvs/pytorch-rocm/bin/activate<br />
python code.py<br />
</pre></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=2987Rouge2021-04-16T15:30:01Z<p>Northrup: </p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd1.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Specifications=<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their COVID-19 support program. The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for MPI communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
Questions about its use or problems should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
= Getting started on Rouge =<br />
<br />
Rouge can be accessed directly.<br />
<pre><br />
ssh -Y MYCCUSERNAME@rouge.scinet.utoronto.ca<br />
</pre><br />
Rouge login node '''rouge-login01''' can also be accessed via Niagara cluster.<br />
<pre><br />
ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y rouge-login01<br />
</pre><br />
<br />
== Storage ==<br />
<br />
The filesystem for Rouge is currently shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on : use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]].<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
<br />
<br />
= Available compilers and interpreters =<br />
<br />
* The <tt>Rocm</tt> module has to be loaded first for GPU software.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> module.<br />
<br />
=== ROCm ===<br />
<br />
= Software =<br />
<br />
== Singularity Containers ==</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=2986Rouge2021-04-16T15:29:49Z<p>Northrup: </p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd1.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Specifications=<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their COVID-19 support program. The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for MPI communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
Questions about its use or problems should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
<br />
= Getting started on Rouge =<br />
<br />
Rouge can be accessed directly.<br />
<pre><br />
ssh -Y MYCCUSERNAME@rouge.scinet.utoronto.ca<br />
</pre><br />
Rouge login node '''rouge-login01''' can also be accessed via Niagara cluster.<br />
<pre><br />
ssh -Y MYCCUSERNAME@niagara.scinet.utoronto.ca<br />
ssh -Y rouge-login01<br />
</pre><br />
<br />
== Storage ==<br />
<br />
The filesystem for Rouge is currently shared with Niagara cluster. See [https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Your_various_directories Niagara Storage] for more details.<br />
<br />
= Loading software modules =<br />
<br />
You have two options for running code on : use existing software, or compile your own. This section focuses on the former.<br />
<br />
Other than essentials, all installed software is made available [[Using_modules | using module commands]]. These modules set environment variables (PATH, etc.), allowing multiple, conflicting versions of a given package to be available. A detailed explanation of the module system can be [[Using_modules | found on the modules page]].<br />
<br />
Common module subcommands are:<br />
<br />
* <code>module load <module-name></code>: load the default version of a particular software.<br />
* <code>module load <module-name>/<module-version></code>: load a specific version of a particular software.<br />
* <code>module purge</code>: unload all currently loaded modules.<br />
* <code>module spider</code> (or <code>module spider <module-name></code>): list available software packages.<br />
* <code>module avail</code>: list loadable software packages.<br />
* <code>module list</code>: list loaded modules.<br />
<br />
Along with modifying common environment variables, such as PATH, and LD_LIBRARY_PATH, these modules also create a SCINET_MODULENAME_ROOT environment variable, which can be used to access commonly needed software directories, such as /include and /lib.<br />
<br />
There are handy abbreviations for the module commands. <code>ml</code> is the same as <code>module list</code>, and <code>ml <module-name></code> is the same as <code>module load <module-name></code>.<br />
<br />
<br />
= Available compilers and interpreters =<br />
<br />
* The <tt>Rocm</tt> module has to be loaded first for GPU software.<br />
* To compile mpi code, you must additionally load an <tt>openmpi</tt> module.<br />
<br />
=== ROCm ===<br />
<br />
= Software =<br />
<br />
== Singularity Containers ==</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=2985Rouge2021-04-16T15:22:53Z<p>Northrup: </p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd1.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Rouge Cluster =<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their COVID-19 program.<br />
<br />
Questions about its use or problems should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
== Specifications==<br />
<br />
The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for MPI communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
== Login/Devel Node ==<br />
<br />
Login to '''rouge-login01'''</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=File:Amd1.jpeg&diff=2984File:Amd1.jpeg2021-04-16T15:22:34Z<p>Northrup: </p>
<hr />
<div></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=2983Rouge2021-04-16T15:21:23Z<p>Northrup: </p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
= Rouge Cluster =<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their COVID-19 program.<br />
<br />
Questions about its use or problems should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
== Specifications==<br />
<br />
The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for MPI communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
== Login/Devel Node ==<br />
<br />
Login to '''rouge-login01'''</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=2982Rouge2021-04-16T15:16:19Z<p>Northrup: </p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd.jpeg|center|rotation:90|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
== Rouge Cluster ==<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their COVID-19 program.<br />
<br />
Questions about its use or problems should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
== Specifications==<br />
<br />
The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for MPI communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
== Login/Devel Node ==<br />
<br />
Login to '''rouge-login01'''</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=2981Rouge2021-04-16T14:54:22Z<p>Northrup: </p>
<hr />
<div>{{Infobox Computer<br />
|image=[[File:Amd.jpeg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
== Rouge Cluster ==<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their COVID-19 program.<br />
<br />
Questions about its use or problems should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
== Specifications==<br />
<br />
The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for MPI communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
== Login/Devel Node ==<br />
<br />
Login to '''rouge-login01'''</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=File:Amd.jpeg&diff=2980File:Amd.jpeg2021-04-16T14:52:03Z<p>Northrup: </p>
<hr />
<div></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=2979Rouge2021-04-16T13:36:17Z<p>Northrup: </p>
<hr />
<div>{{Infobox Computer<br />
|image=[[Image:Ibm_idataplex_dx360_m4.jpg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=March 2021<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|gpuspernode=8 MI50-32GB<br />
|rampernode=512 GB<br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
== Rouge Cluster ==<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their COVID-19 program.<br />
<br />
Questions about its use or problems should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
== Specifications==<br />
<br />
The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for MPI communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
== Login/Devel Node ==<br />
<br />
Login to '''rouge-login01'''</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Rouge&diff=2978Rouge2021-04-16T13:33:51Z<p>Northrup: Created page with "{Infobox Computer |image=thumb |name=Rouge |installed=(orig Feb 2013), Oct 2018 |operatingsystem= Linux (Centos 7.6) |loginn..."</p>
<hr />
<div>{Infobox Computer<br />
|image=[[Image:Ibm_idataplex_dx360_m4.jpg|center|300px|thumb]] <br />
|name=Rouge<br />
|installed=(orig Feb 2013), Oct 2018<br />
|operatingsystem= Linux (Centos 7.6)<br />
|loginnode= rouge-login01<br />
|nnodes=20 <br />
|rampernode=512 Gb <br />
|corespernode=48 <br />
|interconnect=Infiniband (2xEDR)<br />
|vendorcompilers=rocm/gcc<br />
|queuetype=slurm<br />
}}<br />
<br />
== Rouge Cluster ==<br />
<br />
The Rouge cluster was donated to the University of Toronto by AMD as part of their COVID-19 program.<br />
<br />
Questions about its use or problems should be sent to '''support@scinet.utoronto.ca'''.<br />
<br />
== Specifications==<br />
<br />
The cluster consists of 20 x86_64 nodes each with a single AMD EPYC 7642 48-Core CPU running at 2.3GHz with 512GB of RAM and 8 Radeon Instinct MI50 GPUs per node.<br />
<br />
The nodes are interconnected with 2xHDR100 Infiniband for MPI communications and disk I/O to the SciNet Niagara filesystems. In total this cluster contains 960 CPU cores and 160 GPUs. <br />
<br />
== Login/Devel Node ==<br />
<br />
Login to '''rouge-login01'''</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=2966Main Page2021-03-23T16:22:54Z<p>Northrup: </p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up|Niagara|Niagara_Quickstart}}<br />
|{{Up|HPSS|HPSS}}<br />
|{{Up|Mist|Mist}}<br />
|{{Up|Teach|Teach}}<br />
|-<br />
|{{Up|Jupyter Hub|Jupyter_Hub}}<br />
|{{Up|Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up|File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up|Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up|Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up|External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up|Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<br />
Tue 23 Mar 2021 12:19:07 PM EDT - Planned external network maintenance 12pm-1pm Tuesday, March 23rd. <br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=2965Main Page2021-03-23T16:18:50Z<p>Northrup: </p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up|Niagara|Niagara_Quickstart}}<br />
|{{Up|HPSS|HPSS}}<br />
|{{Up|Mist|Mist}}<br />
|{{Up|Teach|Teach}}<br />
|-<br />
|{{Up|Jupyter Hub|Jupyter_Hub}}<br />
|{{Up|Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up|File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up|Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up|Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Down|External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up|Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<br />
Tue 23 Mar 2021 12:19:07 PM EDT - Planned network maintenance from 12pm-1pm March 23rd. <br />
<br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Main_Page&diff=2964Main Page2021-03-18T05:59:51Z<p>Northrup: /* QuickStart Guides */</p>
<hr />
<div>__NOTOC__<br />
{| style="border-spacing:10px; width: 95%"<br />
| style="padding:1em; padding-top:.1em; border:2px solid #0645ad; background-color:#f6f6f6; border-radius:7px"|<br />
<br />
==System Status==<br />
<br />
<!-- Use "Up" or "Down"; these are templates. --><br />
{|style="width:100%" <br />
|{{Up|Niagara|Niagara_Quickstart}}<br />
|{{Up|HPSS|HPSS}}<br />
|{{Up|Mist|Mist}}<br />
|{{Up|Teach|Teach}}<br />
|-<br />
|{{Up|Jupyter Hub|Jupyter_Hub}}<br />
|{{Up|Scheduler|Niagara_Quickstart#Submitting_jobs}}<br />
|{{Up|File system|Niagara_Quickstart#Storage_and_quotas}}<br />
|{{Up|Burst Buffer|Burst_Buffer}}<br />
|-<br />
|{{Up|Login Nodes|Niagara_Quickstart#Logging_in}} <br />
|{{Up|External Network|Niagara_Quickstart#Logging_in}} <br />
|{{Up|Globus|Globus}}<br />
|}<br />
<br />
<!-- Current Messages: --><br />
<!-- When removing system status entries, please archive them to: https://docs.scinet.utoronto.ca/index.php/Previous_messages --><br />
{|style="border-spacing: 10px;width: 100%"<br />
|valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== QuickStart Guides ==<br />
* [[Niagara Quickstart]]<br />
* [[HPSS | HPSS archival storage]]<br />
* [[Mist| Mist Power 9 GPU cluster]]<br />
* [[Teach|Teach cluster]]<br />
* [[FAQ | FAQ (frequently asked questions)]]<br />
* [[Acknowledging SciNet]]<br />
| valign="top" style="margin: 1em; padding:1em; padding-top:.1em; border:2px solid #000; background-color:#fff; border-radius:7px; width: 49.5%" |<br />
<br />
== Tutorials, Manuals, etc. ==<br />
* [https://support.scinet.utoronto.ca/education/browse.php SciNet education material]<br />
* [https://www.youtube.com/c/SciNetHPCattheUniversityofToronto SciNet's YouTube channel]<br />
* [[Modules specific to Niagara|Software Modules specific to Niagara]] <br />
* [[Commercial software]]<br />
* [[Burst Buffer]]<br />
* [[SSH Tunneling]]<br />
* [[SSH#Two-Factor_authentication|Two-Factor Authentication]]<br />
* [[Visualization]]<br />
* [[Running Serial Jobs on Niagara]]<br />
* [[Jupyter Hub]]<br />
|}</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Parallel_Debugging_with_DDT&diff=2947Parallel Debugging with DDT2021-02-17T05:32:00Z<p>Northrup: </p>
<hr />
<div>==ARM DDT Parallel Debugger==<br />
<br />
For parallel debugging, SciNet has DDT ("Distributed Debugging Tool") installed on all our clusters. DDT is a powerful, GUI-based commercial debugger by ARM (formerly by Allinea). It supports the programming languages C, C++, and Fortran, and the parallel programming paradigms MPI, OpenMPI, and CUDA. DDT can also be very useful for serial programs. DDT provides a nice, intuitive graphical user interface. It does need graphics support, so make sure to use the '-X' or '-Y' arguments to your ssh commands, so that X11 graphics can find its way back to your screen ("X forwarding").<br />
<br />
The most currently installed version of ddt on [[Niagara_Quickstart | Niagara]] is DDT 20.1.3. The ddt license allows up to a total of 64 processes to be debugged simultaneously (shared among all users).<br />
<br />
To use ddt, ssh in with X forwarding enabled, load your usual compiler and mpi modules, compile your code with '-g' and load the module<br />
<br />
<code>module load ddt</code><br />
<br />
You can then start ddt with one of the following commands:<br />
<br />
<code>ddt</code><br />
<br />
<code>ddt <executable compiled with -g flag> </code><br />
<br />
<code>ddt <executable compiled with -g flag> <arguments> </code><br />
<br />
<code>ddt -n <numprocs> <executable compiled with -g flag> <arguments> </code><br />
<br />
The first time you run DDT, it will set up configuration files. It puts these in the hidden directory $SCRATCH/.allinea. <br />
<br />
Note that most users will debug on the login nodes of the a clusters (nia-login0{1-3,5-7}), but that this is only appropriate if the number of mpi processes and threads is small, and the memory usage is not too large. If your debugging requires more resources, you should run it through the queue. On Niagara, an interactive debug session will suit most debugging purposes.<br />
<br />
==ARM MAP Parallel Profiler==<br />
<br />
MAP is a parallel (MPI) performance analyser with a graphical interface. It is part of the same DDT module, so you need to load <tt>ddt</tt> to use MAP (together, DDT and MAP form the <i>ARM Forge</i> bundel).<br />
<br />
It has a similar job startup interface as DDT. <br />
<br />
To be more precise, MAP is a sampling profiler with adaptive sampling rates to keep the<br />
data volumes collected under control. Samples are aggregated at all levels to preserve key features of<br />
a run without drowning in data. A folding code and stack viewer allows you to zoom into time<br />
spent on individual lines and draw back to see the big picture across nests of routines. MAP measures memory usage, floating-point calculations, MPI usage, as well as I/O.<br />
<br />
The maximum number of MPI processes for that our MAP license supports is 64 (simultaneously shared among all users).<br />
<br />
It supports both interactive and batch modes for gathering profile data.<br />
<br />
===Interactive profiling with MAP===<br />
<br />
Startup is much the same as for DDT:<br />
<br />
<code>map</code><br />
<br />
<code>map <executable compiled with -g flag> </code><br />
<br />
<code>map <executable compiled with -g flag> <arguments> </code><br />
<br />
<code>map -n <numprocs> <executable compiled with -g flag> <arguments> </code><br />
<br />
After you have started the code and it has run to completion, MAP will show the results. It will also save these results in a file with the extension <tt>.map</tt>. This allows you to load the result again into the graphical user interface at a later time.<br />
<br />
===Non-interactive profiling with MAP===<br />
<br />
It is also possible to run map non-interactively by passing the <tt>-profile</tt> flag, e.g.<br />
<br />
<code>map -profile -n <numprocs> <executable compiled with -g flag> <arguments> </code><br />
<br />
For instance, this could be used in a job when it is launched with a jobscript like<br />
<br />
<source lang="bash">#!/bin/bash <br />
#SBATCH --nodes=1<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --time=1:00:00<br />
#SBATCH --job-name=mpi_job<br />
#SBATCH --output=mpi_output_%j.txt<br />
#SBATCH --mail-type=FAIL<br />
<br />
module load intel/2018.2<br />
module load openmpi/3.1.0<br />
module load ddt<br />
<br />
map -profile -n $SLURM_NTASKS ./mpi_example<br />
</source><br />
<br />
This will just create the <tt>.map</tt> file, which you could inspect after the job has finished with<br />
<br />
<code>map MAPFILE</code><br />
<br />
==Parallel Debugging and Profiling in an Interactive Session on Niagara==<br />
<br />
By requesting a job from the 'debug' partition on Niagara, you can have access to at most 4 nodes, i.e., a total of 160 physical cores (or 320 virtual cores, using hyper-threading), for your exclusive, interactive use. Starting from a Niagara login node, you would request a debug sessions with the following command:<br />
<br />
<code>debugjob <numberofnodes></code><br />
<br />
where <tt><numberofnodes></tt> is 1, 2, 3, or 4. The sessions will last 60, 45, 30, or 15 minutes, depending on the number of nodes requested.<br />
<br />
This command will get you a prompt on a compute node (or on the 'head' node if you've asked for more than one node). Reload any modules that your application needs (e.g. <tt>module load intel openmpi</tt>), as well as the <tt>ddt</tt> module.<br />
<br />
Note that on compute nodes, $HOME is read-only, so unless your code is on $SCRATCH, you cannot recompile it (with '-g') in the debug session; this should have been done on a login node.<br />
<br />
If the time restrictions of these debugjobs is too great, you need to request nodes from the regular queue. In that case, you want to make sure that you get [[Testing_With_Graphics|X11 graphics forwarded properly]].<br />
<br />
Within this debugjob session, you can then use the <tt>ddt</tt> and <tt>map</tt> commands.<br />
<br />
==Setting up a client-server connection==<br />
<br />
If you're working from home, or any other location where there isn't a fast internet connection, it is likely to be advantageous to run DDT or MAP in client-server mode. This keeps the bulk of the computation on Niagara or Mist (the server), while sending only the minimum amount of information over the internet to your locally-running version of DDT (the client).<br />
<br />
===Setting up the server side===<br />
<br />
The first step is to connect to Niagara (or Mist), and start a debug session<br />
<br />
ejspence@nia-login01 $ debugjob -N 1<br />
debugjob: Requesting 1 node(s) with 40 core(s) for 60 minutes and 0 seconds<br />
SALLOC: Granted job allocation 3995470<br />
SALLOC: Waiting for resource configuration<br />
SALLOC: Nodes nia0003 are ready for job<br />
ejspence@nia0003 $<br />
<br />
This will start an interactive debug session, on a single node, for an hour. Be sure to note the node which you have been allocated (nia003 in this case).<br />
<br />
The next step is to determine the path to DDT. To do this you will need load the DDT module:<br />
<br />
ejspence@nia0003 $ module load NiaEnv/2019b<br />
ejspence@nia0003 $ module load ddt/19.1<br />
ejspence@nia0003 $<br />
ejspence@nia0003 $ echo $SCINET_DDT_ROOT<br />
/scinet/niagara/software/2019b/opt/base/ddt/19.1<br />
ejspence@nia0003 $<br />
<br />
The next step is to create a startup script which will be run by the server, in case you are running on multiple nodes:<br />
<br />
#!/bin/bash<br />
module purge<br />
module load NiaEnv/2019b<br />
module load gcc/8.3.0 openmpi/4.0.1 ddt/19.1<br />
export ARM_TOOLS_CONFIG_DIR=${SCRATCH}/.arm<br />
mkdir -p ${ARM_TOOLS_CONFIG_DIR}<br />
export OMPI_MCA_pml=ob1<br />
<br />
Be sure to load whatever modules your code needs to run. Let us assume that the PATH to this script is $SCRATCH/ddt_remote_setup.sh.<br />
<br />
This completes the setup of the server side. There is no need to launch the server, the client itself will do this.<br />
<br />
===Setting up the client side===<br />
<br />
You now need to setup the client on your local machine (desktop or laptop). The first step is to go to [https://developer.arm.com/tools-and-software/server-and-hpc/downloads/arm-forge/older-versions-of-remote-client-for-arm-forge this page] to download the Arm Forge client. Note that this page is for older versions of DDT. This is because the client and the server must be running the same version of DDT, and the version on Niagara is 19.1. Download the version of the client appropriate for your local machine, and install it.<br />
<br />
Now launch Arm Forge. You will see a screen similar to below. Select "Remote Launch", then "Configure".<br />
<br />
{| align="center"<br />
| [[File:DDT openning.png|480px|]]<br />
|}<br />
<br />
You will see that there are no sessions already configured. Click on "Add" to create a new session configuration.<br />
<br />
{| align="center"<br />
| [[File:DDT sessions.png|480px|]]<br />
|}<br />
<br />
Next, fill in the details of the session. You need to fill in <br />
* the name of the session,<br />
* the host name, consisting of<br />
** your login credentials for Niagara (or Mist),<br />
** a space,<br />
** you user name and the node you are using (nia0003 in this example),<br />
* the installation directory of DDT on Niagara,<br />
* the location of your startup script.<br />
<br />
{| align="center"<br />
| [[File:DDT settings.png|540px|]]<br />
|}<br />
<br />
After you've entered the settings, click on "OK". This should bring you to the screen seen below.<br />
<br />
{| align="center"<br />
| [[File:DDT sessions2.png|480px|]]<br />
|}<br />
<br />
The openning screen should now look like below.<br />
<br />
{| align="center"<br />
| [[File:DDT openning2.png|480px|]]<br />
|}<br />
<br />
Click on the session you'd like to launch. In this example, "DDT Test". This will bring you to DDT's launch screen, which is the same you'll see when you run DDT normally. Note that the code and files you will be testing must be hosted on Niagara, not on your local machine.</div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Ansys&diff=2945Ansys2021-02-16T14:16:07Z<p>Northrup: </p>
<hr />
<div>The [http://www.ansys.com/ Ansys] engineering simulation tools are installed in both the Niagara and CC software stacks.<br />
<br />
=Getting a license=<br />
Licenses are provided by [http://www.cmc.ca CMC Microsystems]. Canadian students and faculty can register at [https://www.cmc.ca/en/MyAccount/GetAccount.aspx this page].<br />
<br />
Once you have an account, you must contact CMC and tell them you want to use the Ansys tools on Niagara, and give them your SciNet username.<br />
<br />
=Running using the Niagara installation=<br />
<br />
==Ansys 2020R2==<br />
Commercial modules can only be accessed using the 'module use' command.<br />
<br />
module use /scinet/niagara/software/commercial/modules<br />
module load ansys/2020r2<br />
<br />
Programs available:<br />
<br />
* fluent<br />
* ansysedt<br />
* mapdl<br />
* ...<br />
You can use the Ansys graphical tools to set up your problem, but you cannot use the graphical tools to submit your job. The job must be submitted to the scheduler for running.<br />
<br />
==Setting up your .mw directory==<br />
<br />
Ansys will attempt to write to your $HOME/.mw directory. This will work when you are testing your workflow on the login nodes, because they can write to $HOME. However, recall that the compute nodes cannot write to the /home filesystem. If you attempt to run Ansys from a compute node using the default configuration, it will fail because Ansys cannot write to $HOME/.mw.<br />
<br />
The solution is to create an alternative directory called $SCRATCH/.mw, and create a soft link from $HOME/.mw to $SCRATCH/.mw:<br />
mkdir $SCRATCH/.mw<br />
ln -s $SCRATCH/.mw $HOME/.mw<br />
This will fool Ansys into thinking it is writing to $HOME/.mw, when in fact it is writing to $SCRATCH/.mw. This command only needs to be run once.<br />
<br />
==Running ansys202==<br />
<br />
Example submission script for a job running on 1 node, with max walltime of 11 hours:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --time=11:00:00<br />
#SBATCH --job-name test<br />
<br />
module use /scinet/niagara/software/commercial/modules<br />
module load ansys/2020r2<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
machines=$(srun bash -c 'hostname -s' | sort | uniq | awk '{print $1 ":" 40}' | paste -s -d ':')<br />
ansys202 -b -j JOBNAME -dis -machines "$machines" -i ansys.in<br />
</source><br />
<br />
<br />
More infromation can be found here: https://docs.computecanada.ca/wiki/ANSYS<br />
<br />
<!--<br />
INPUTFILE=input.jou<br />
fluent 2ddp -t "$PBS_NP" -cnf="$PBS_NODEFILE" -mpi=intel -pib -pcheck -g -i "$INPUTFILE"<br />
<br />
<br />
=Running using the CC installation=<br />
<br />
==Ansys 19.0==<br />
To access the CC software stack you must unload the Niagara stack.<br />
<br />
module load CCEnv StdEnv<br />
module load ansys/19.0<br />
<br />
You can run the script given in the previous section by substituting the previous module commands with the above two.<br />
<br />
More infromation can be found here: https://docs.computecanada.ca/wiki/ANSYS <br />
<br />
<br />
--></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Ansys&diff=2944Ansys2021-02-16T14:15:33Z<p>Northrup: </p>
<hr />
<div>The [http://www.ansys.com/ Ansys] engineering simulation tools are installed in both the Niagara and CC software stacks.<br />
<br />
=Getting a license=<br />
Licenses are provided by [http://www.cmc.ca CMC Microsystems]. Canadian students and faculty can register at [https://www.cmc.ca/en/MyAccount/GetAccount.aspx this page].<br />
<br />
Once you have an account, you must contact CMC and tell them you want to use the Ansys tools on Niagara, and give them your SciNet username.<br />
<br />
=Running using the Niagara installation=<br />
<br />
==Ansys 2020R2==<br />
Commercial modules can only be accessed using the 'module use' command.<br />
<br />
module use /scinet/niagara/software/commercial/modules<br />
module load ansys/2020r2<br />
<br />
Programs available:<br />
<br />
* fluent<br />
* ansysedt<br />
* mapdl<br />
* ...<br />
You can use the Ansys graphical tools to set up your problem, but you cannot use the graphical tools to submit your job. The job must be submitted to the scheduler for running.<br />
<br />
==Setting up your .mw directory==<br />
<br />
Ansys will attempt to write to your $HOME/.mw directory. This will work when you are testing your workflow on the login nodes, because they can write to $HOME. However, recall that the compute nodes cannot write to the /home filesystem. If you attempt to run Ansys from a compute node using the default configuration, it will fail because Ansys cannot write to $HOME/.mw.<br />
<br />
The solution is to create an alternative directory called $SCRATCH/.mw, and create a soft link from $HOME/.mw to $SCRATCH/.mw:<br />
mkdir $SCRATCH/.mw<br />
ln -s $SCRATCH/.mw $HOME/.mw<br />
This will fool Ansys into thinking it is writing to $HOME/.mw, when in fact it is writing to $SCRATCH/.mw. This command only needs to be run once.<br />
<br />
==Running ansys202==<br />
<br />
Example submission script for a job running on 1 node, with max walltime of 11 hours:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --time=11:00:00<br />
#SBATCH --job-name test<br />
<br />
module use /scinet/niagara/software/commercial/modules<br />
module load ansys/2020r2<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
machines=$(srun bash -c 'hostname -s' | sort | uniq | awk '{print $1 ":" 40}' | paste -s -d ':')<br />
ansys202 -b -j JOBNAME -dis -machines "$machines" -i ansys.in<br />
</source><br />
<br />
<!--<br />
INPUTFILE=input.jou<br />
fluent 2ddp -t "$PBS_NP" -cnf="$PBS_NODEFILE" -mpi=intel -pib -pcheck -g -i "$INPUTFILE"<br />
<br />
<br />
=Running using the CC installation=<br />
<br />
==Ansys 19.0==<br />
To access the CC software stack you must unload the Niagara stack.<br />
<br />
module load CCEnv StdEnv<br />
module load ansys/19.0<br />
<br />
You can run the script given in the previous section by substituting the previous module commands with the above two.<br />
<br />
More infromation can be found here: https://docs.computecanada.ca/wiki/ANSYS <br />
<br />
<br />
--></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Co-array_Fortran_on_Niagara&diff=2941Co-array Fortran on Niagara2021-02-03T01:37:51Z<p>Northrup: /* Running */</p>
<hr />
<div>Versions 12 and higher of the Intel Fortran compiler, and version 5.1 and up of the GNU Fortran compiler, support almost all of Co-array Fortran, and are installed on [[Niagara Quickstart | Niagara]]. <br />
<br />
This page will briefly sketch how to compile and run Co-array Fortran programs using these compilers.<br />
<br />
==Example==<br />
Here is an example of a co-array fortran program:<br />
<source lang="fortran"><br />
program Hello_World<br />
integer :: i ! Local variable<br />
integer :: num[*] ! scalar coarray<br />
if (this_image() == 1) then<br />
write(*,'(a)') 'Enter a number: '<br />
read(*,'(i80)') num<br />
! Distribute information to other images<br />
do i = 2, num_images()<br />
num[i] = num<br />
end do<br />
end if<br />
sync all ! Barrier to make sure the data has arrived<br />
! I/O from all nodes<br />
write(*,'(a,i0,a,i0)') 'Hello ',num,' from image ', this_image()<br />
end program Hello_world<br />
</source><br />
(Adapted from [http://en.wikipedia.org/wiki/Co-array_Fortran]).<br />
<br />
Compiling, linking and running co-array fortran programs is different depending on whether you will run the program only on a single node (with 8 cores), or on several nodes, and depends on which compiler you are using, Intel, or GNU.<br />
<br />
==Intel compiler instructions for Coarray Fortran==<br />
<br />
===Loading necessary modules===<br />
First, you need to load the module for version 12 or greater of the Intel compilers, as well as Intel MPI.<br />
<pre><br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
</pre><br />
<br />
There are two modes in which the intel compiler supports coarray fortran:<br />
<br />
1. Single node usage<br />
<br />
2. Multiple node usage<br />
<br />
The way you compile and run for these two cases is different.<br />
<br />
<!-- <br />
However, we're working on making coarray fortran compilation and running more uniform among these two cases, as well as with the, as-yet-experimental, gfortran coarray support. See [[#Uniformized_Usage | Uniformized Usage]] below.<br />
--><br />
<br />
Note: For multiple node usage, it makes sense to have to load the IntelMPI module, since Intel's implementation of Co-array Fortran uses MPI. However, the Intel MPI module is needed even for single-node usage, just in order to link successfully.<br />
<br />
===Single node usage===<br />
<br />
====Compilation====<br />
<source lang="bash"><br />
ifort -O3 -xHost -coarray=shared -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
ifort -coarray=shared [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
To run this co-array program on one node with 16 images (co-array version for what openmp calls a thread and mpi calls a process), you simply put<br />
<source lang="bash"><br />
./[executable]<br />
</source><br />
in your job submission script. The reason that this gives 16 images is that [[Slurm#Hyperthreading:_Logical_CPUs_vs._cores | HyperThreading]] is enabled on Niagara nodes, which makes it seem to the system as if there are 80 computing units on a node, even though physically there are only 40.<br />
<br />
To control the number of images, you can change the <tt>FOR_COARRAY_NUM_IMAGES</tt> environment variable:<br />
<source lang="bash"><br />
export FOR_COARRAY_NUM_IMAGES=2<br />
./[executable]<br />
</source><br />
This can be useful for testing.<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# SLURM submission script for SciNet Niagara (Intel Coarray Fortran)<br />
#<br />
#SBATCH --nodes=1<br />
#SBATCH --time=1:00:00<br />
#SBATCH --cpus-per-task=40<br />
#SBATCH --job-name test<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
<br />
# RUN THE APPLICATION WITH 80 IMAGES<br />
export FOR_COARRAY_NUM_IMAGES=80<br />
./[executable]<br />
</source><br />
<br />
<br />
===Multiple nodes usage===<br />
<br />
Please read over the following link: [https://software.intel.com/content/www/us/en/develop/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential.html] for the newer intel compilers<br />
<br />
<source lang="bash"><br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
</source><br />
<br />
====Compilation====<br />
<source lang="bash"><br />
ifort -O3 -xHost -coarray=distributed -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
ifort -coarray=distributed [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
Because distributed co-array fortran is based on MPI, we need to launch the mpi processes on different nodes. <br />
<br />
<!-- <br />
The defaults will work on Niagara, however the number of images will be equal to the number of nodes * cpus-per-tasks<br />
--><br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
#<br />
#SBATCH --nodes=4<br />
#SBATCH --time=1:00:00<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --job-name test<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
<br />
export FOR_COARRAY_NUM_IMAGES=$SLURM_NTASKS<br />
<br />
# EXECUTION export COMMAND;FOR_COARRAY_NUM_IMAGES = nodes*ntasks-per-node<br />
./[executable]<br />
</source><br />
<br />
You can provide a configuration file using the ifort '-corray-config-file=file.cfg` option that will allow you to provide<br />
your own MPI parameters including the number of tasks per host and the number of total tasks, ie. images. <br />
<br />
<br />
<!--<br />
<br />
===Features known not to workin earlier versions===<br />
<br />
There are a few features that are known not to work in the current version of the Intel Fortran compiler (v12.0), such as character coarrays. See section 3.2.3.3 of the [http://software.intel.com/en-us/articles/intel-fortran-composer-xe-2011-release-notes/ official release notes]. These issues may get fixed in later releases.<br />
<br />
===Uniformized Usage===<br />
<br />
If you load the addition module<br />
<pre><br />
module load caf/intel/any<br />
</pre><br />
you get access to a compilation and linking wrapper called <tt>caf</tt> and a wrapper for running the application called <tt>cafrun</tt>.<br />
<br />
====Compilation====<br />
<source lang="bash"><br />
caf -O3 -xhost -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
caf [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
To run this co-array program on one node with 40 images, you simply put<br />
<source lang="bash"><br />
cafrun ./[executable]<br />
</source><br />
This runs 40 images, not 80.<br />
<br />
To control the number of images, you can change the run command to<br />
<source lang="bash"><br />
cafrun -np 2 ./[executable]<br />
</source><br />
This can be useful for testing.<br />
<br />
To control the number of images per node, add the <tt>-N [images-per-node]</tt> option.<br />
<br />
Note: currently, the uniformized mode doesn't explicitly utilize optimization opportunities offered by the single node mode, although it will work on one node.<br />
<br />
<br />
<br />
==GNU compiler instructions for Coarray Fortran==<br />
<br />
Coarray fortran is supported in the GNU compiler suite (GCC) starting from version 5.1. To implement coarrays, it uses the <tt>opencoarray</tt> library, which in turns uses openmpi (or at least, that's how it has been setup on the GPC).<br />
<br />
''Issues with the gcc/opencoarray fortran compilers seem to exist, particularly with multidimensional arrays. We're still investigating the cause, but for now, one should consider these coarray fortran support by gcc as experimental''<br />
<br />
===Loading necessary modules===<br />
First, you need to load the module for version 5.2 or greater of the GNU compilers (version 5.1 would've worked, but we skipped that release on the GPC), as well as OpenMPI.<br />
<pre><br />
module load gcc/5.2.0 openmpi/gcc/1.8.3 use.experimental caf/gcc/5.2.0-openmpi<br />
</pre><br />
<br />
The caf/gcc/5.2.0-openmpi modules comes with a compilation and linking wrapper called <tt>caf</tt> and a wrapper for running the application called <tt>cafrun</tt>.<br />
<br />
===Compilation===<br />
<source lang="bash"><br />
caf -O3 -march=native -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
===Linking===<br />
<source lang="bash"><br />
caf [objectfile] -o [executable]<br />
</source><br />
<br />
===Running===<br />
To run this co-array program on one node with 8 images (co-array version for what openmp calls a thread and mpi calls a process), you simply put<br />
<source lang="bash"><br />
cafrun ./[executable]<br />
</source><br />
in your job submission script. In contrast with the Intel compiler, this does not runs 16 images, but only 8. The reason is that the gcc/opencoarray implementation uses MPI, and MPI is not aware of [[GPC_Quickstart#HyperThreading | HyperThreading]]. <br />
<br />
To control the number of images, you can change the run command to<br />
<source lang="bash"><br />
cafrun -np 2 ./[executable]<br />
</source><br />
This can be useful for testing, or to exploit HyperThreading.<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# MOAB/Torque submission script for SciNet GPC (GCC Coarray Fortran)<br />
#<br />
#PBS -l nodes=1:ppn=8,walltime=1:00:00<br />
#PBS -N test<br />
<br />
# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from<br />
cd $PBS_O_WORKDIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load gcc/5.2.0 openmpi/gcc/1.8.3<br />
<br />
# RUN WITH 16 IMAGES ON 1 NODE<br />
cafrun -np 16 ./[executable]<br />
</source><br />
<br />
===Multiple nodes usage===<br />
<br />
Because the GNU implementation of Coarray Fortran in the gcc/5.2.0 module is based on MPI, running on multiple nodes is no different from the single-node usage.<br />
An example multi-node submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# MOAB/Torque submission script for SciNet GPC (GCC Coarray Fortran on multiple nodes)<br />
#<br />
#PBS -l nodes=4:ppn=8,walltime=1:00:00<br />
#PBS -N test<br />
<br />
# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from<br />
cd $PBS_O_WORKDIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load gcc/5.2.0 openmpi/gcc/1.8.3<br />
<br />
# EXECUTION with 32 images (nodes*ppn)<br />
cafrun -np 32 ./[executable]<br />
</source><br />
<br />
--></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Co-array_Fortran_on_Niagara&diff=2940Co-array Fortran on Niagara2021-02-03T01:35:31Z<p>Northrup: /* Running */</p>
<hr />
<div>Versions 12 and higher of the Intel Fortran compiler, and version 5.1 and up of the GNU Fortran compiler, support almost all of Co-array Fortran, and are installed on [[Niagara Quickstart | Niagara]]. <br />
<br />
This page will briefly sketch how to compile and run Co-array Fortran programs using these compilers.<br />
<br />
==Example==<br />
Here is an example of a co-array fortran program:<br />
<source lang="fortran"><br />
program Hello_World<br />
integer :: i ! Local variable<br />
integer :: num[*] ! scalar coarray<br />
if (this_image() == 1) then<br />
write(*,'(a)') 'Enter a number: '<br />
read(*,'(i80)') num<br />
! Distribute information to other images<br />
do i = 2, num_images()<br />
num[i] = num<br />
end do<br />
end if<br />
sync all ! Barrier to make sure the data has arrived<br />
! I/O from all nodes<br />
write(*,'(a,i0,a,i0)') 'Hello ',num,' from image ', this_image()<br />
end program Hello_world<br />
</source><br />
(Adapted from [http://en.wikipedia.org/wiki/Co-array_Fortran]).<br />
<br />
Compiling, linking and running co-array fortran programs is different depending on whether you will run the program only on a single node (with 8 cores), or on several nodes, and depends on which compiler you are using, Intel, or GNU.<br />
<br />
==Intel compiler instructions for Coarray Fortran==<br />
<br />
===Loading necessary modules===<br />
First, you need to load the module for version 12 or greater of the Intel compilers, as well as Intel MPI.<br />
<pre><br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
</pre><br />
<br />
There are two modes in which the intel compiler supports coarray fortran:<br />
<br />
1. Single node usage<br />
<br />
2. Multiple node usage<br />
<br />
The way you compile and run for these two cases is different.<br />
<br />
<!-- <br />
However, we're working on making coarray fortran compilation and running more uniform among these two cases, as well as with the, as-yet-experimental, gfortran coarray support. See [[#Uniformized_Usage | Uniformized Usage]] below.<br />
--><br />
<br />
Note: For multiple node usage, it makes sense to have to load the IntelMPI module, since Intel's implementation of Co-array Fortran uses MPI. However, the Intel MPI module is needed even for single-node usage, just in order to link successfully.<br />
<br />
===Single node usage===<br />
<br />
====Compilation====<br />
<source lang="bash"><br />
ifort -O3 -xHost -coarray=shared -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
ifort -coarray=shared [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
To run this co-array program on one node with 16 images (co-array version for what openmp calls a thread and mpi calls a process), you simply put<br />
<source lang="bash"><br />
./[executable]<br />
</source><br />
in your job submission script. The reason that this gives 16 images is that [[Slurm#Hyperthreading:_Logical_CPUs_vs._cores | HyperThreading]] is enabled on Niagara nodes, which makes it seem to the system as if there are 80 computing units on a node, even though physically there are only 40.<br />
<br />
To control the number of images, you can change the <tt>FOR_COARRAY_NUM_IMAGES</tt> environment variable:<br />
<source lang="bash"><br />
export FOR_COARRAY_NUM_IMAGES=2<br />
./[executable]<br />
</source><br />
This can be useful for testing.<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# SLURM submission script for SciNet Niagara (Intel Coarray Fortran)<br />
#<br />
#SBATCH --nodes=1<br />
#SBATCH --time=1:00:00<br />
#SBATCH --cpus-per-task=40<br />
#SBATCH --job-name test<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
<br />
# RUN THE APPLICATION WITH 80 IMAGES<br />
export FOR_COARRAY_NUM_IMAGES=80<br />
./[executable]<br />
</source><br />
<br />
<br />
===Multiple nodes usage===<br />
<br />
Please read over the following link: [https://software.intel.com/content/www/us/en/develop/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential.html] for the newer intel compilers<br />
<br />
<source lang="bash"><br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
</source><br />
<br />
====Compilation====<br />
<source lang="bash"><br />
ifort -O3 -xHost -coarray=distributed -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
ifort -coarray=distributed [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
Because distributed co-array fortran is based on MPI, we need to launch the mpi processes on different nodes. The defaults will<br />
work on Niagara, however the number of images will be equal to the number of nodes * cpus-per-<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
#<br />
#SBATCH --nodes=4<br />
#SBATCH --time=1:00:00<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --job-name test<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
<br />
export FOR_COARRAY_NUM_IMAGES=$SLURM_NTASKS<br />
<br />
# EXECUTION export COMMAND;FOR_COARRAY_NUM_IMAGES = nodes*ntasks-per-node<br />
./[executable]<br />
</source><br />
<br />
You can provide a configuration file using the ifort '-corray-config-file=file.cfg` option that will allow you to provide<br />
your own MPI parameters including the number of tasks per host and the number of total tasks, ie. images. <br />
<br />
<br />
<!--<br />
<br />
===Features known not to workin earlier versions===<br />
<br />
There are a few features that are known not to work in the current version of the Intel Fortran compiler (v12.0), such as character coarrays. See section 3.2.3.3 of the [http://software.intel.com/en-us/articles/intel-fortran-composer-xe-2011-release-notes/ official release notes]. These issues may get fixed in later releases.<br />
<br />
===Uniformized Usage===<br />
<br />
If you load the addition module<br />
<pre><br />
module load caf/intel/any<br />
</pre><br />
you get access to a compilation and linking wrapper called <tt>caf</tt> and a wrapper for running the application called <tt>cafrun</tt>.<br />
<br />
====Compilation====<br />
<source lang="bash"><br />
caf -O3 -xhost -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
caf [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
To run this co-array program on one node with 40 images, you simply put<br />
<source lang="bash"><br />
cafrun ./[executable]<br />
</source><br />
This runs 40 images, not 80.<br />
<br />
To control the number of images, you can change the run command to<br />
<source lang="bash"><br />
cafrun -np 2 ./[executable]<br />
</source><br />
This can be useful for testing.<br />
<br />
To control the number of images per node, add the <tt>-N [images-per-node]</tt> option.<br />
<br />
Note: currently, the uniformized mode doesn't explicitly utilize optimization opportunities offered by the single node mode, although it will work on one node.<br />
<br />
<br />
<br />
==GNU compiler instructions for Coarray Fortran==<br />
<br />
Coarray fortran is supported in the GNU compiler suite (GCC) starting from version 5.1. To implement coarrays, it uses the <tt>opencoarray</tt> library, which in turns uses openmpi (or at least, that's how it has been setup on the GPC).<br />
<br />
''Issues with the gcc/opencoarray fortran compilers seem to exist, particularly with multidimensional arrays. We're still investigating the cause, but for now, one should consider these coarray fortran support by gcc as experimental''<br />
<br />
===Loading necessary modules===<br />
First, you need to load the module for version 5.2 or greater of the GNU compilers (version 5.1 would've worked, but we skipped that release on the GPC), as well as OpenMPI.<br />
<pre><br />
module load gcc/5.2.0 openmpi/gcc/1.8.3 use.experimental caf/gcc/5.2.0-openmpi<br />
</pre><br />
<br />
The caf/gcc/5.2.0-openmpi modules comes with a compilation and linking wrapper called <tt>caf</tt> and a wrapper for running the application called <tt>cafrun</tt>.<br />
<br />
===Compilation===<br />
<source lang="bash"><br />
caf -O3 -march=native -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
===Linking===<br />
<source lang="bash"><br />
caf [objectfile] -o [executable]<br />
</source><br />
<br />
===Running===<br />
To run this co-array program on one node with 8 images (co-array version for what openmp calls a thread and mpi calls a process), you simply put<br />
<source lang="bash"><br />
cafrun ./[executable]<br />
</source><br />
in your job submission script. In contrast with the Intel compiler, this does not runs 16 images, but only 8. The reason is that the gcc/opencoarray implementation uses MPI, and MPI is not aware of [[GPC_Quickstart#HyperThreading | HyperThreading]]. <br />
<br />
To control the number of images, you can change the run command to<br />
<source lang="bash"><br />
cafrun -np 2 ./[executable]<br />
</source><br />
This can be useful for testing, or to exploit HyperThreading.<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# MOAB/Torque submission script for SciNet GPC (GCC Coarray Fortran)<br />
#<br />
#PBS -l nodes=1:ppn=8,walltime=1:00:00<br />
#PBS -N test<br />
<br />
# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from<br />
cd $PBS_O_WORKDIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load gcc/5.2.0 openmpi/gcc/1.8.3<br />
<br />
# RUN WITH 16 IMAGES ON 1 NODE<br />
cafrun -np 16 ./[executable]<br />
</source><br />
<br />
===Multiple nodes usage===<br />
<br />
Because the GNU implementation of Coarray Fortran in the gcc/5.2.0 module is based on MPI, running on multiple nodes is no different from the single-node usage.<br />
An example multi-node submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# MOAB/Torque submission script for SciNet GPC (GCC Coarray Fortran on multiple nodes)<br />
#<br />
#PBS -l nodes=4:ppn=8,walltime=1:00:00<br />
#PBS -N test<br />
<br />
# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from<br />
cd $PBS_O_WORKDIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load gcc/5.2.0 openmpi/gcc/1.8.3<br />
<br />
# EXECUTION with 32 images (nodes*ppn)<br />
cafrun -np 32 ./[executable]<br />
</source><br />
<br />
--></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Co-array_Fortran_on_Niagara&diff=2939Co-array Fortran on Niagara2021-02-03T01:14:05Z<p>Northrup: </p>
<hr />
<div>Versions 12 and higher of the Intel Fortran compiler, and version 5.1 and up of the GNU Fortran compiler, support almost all of Co-array Fortran, and are installed on [[Niagara Quickstart | Niagara]]. <br />
<br />
This page will briefly sketch how to compile and run Co-array Fortran programs using these compilers.<br />
<br />
==Example==<br />
Here is an example of a co-array fortran program:<br />
<source lang="fortran"><br />
program Hello_World<br />
integer :: i ! Local variable<br />
integer :: num[*] ! scalar coarray<br />
if (this_image() == 1) then<br />
write(*,'(a)') 'Enter a number: '<br />
read(*,'(i80)') num<br />
! Distribute information to other images<br />
do i = 2, num_images()<br />
num[i] = num<br />
end do<br />
end if<br />
sync all ! Barrier to make sure the data has arrived<br />
! I/O from all nodes<br />
write(*,'(a,i0,a,i0)') 'Hello ',num,' from image ', this_image()<br />
end program Hello_world<br />
</source><br />
(Adapted from [http://en.wikipedia.org/wiki/Co-array_Fortran]).<br />
<br />
Compiling, linking and running co-array fortran programs is different depending on whether you will run the program only on a single node (with 8 cores), or on several nodes, and depends on which compiler you are using, Intel, or GNU.<br />
<br />
==Intel compiler instructions for Coarray Fortran==<br />
<br />
===Loading necessary modules===<br />
First, you need to load the module for version 12 or greater of the Intel compilers, as well as Intel MPI.<br />
<pre><br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
</pre><br />
<br />
There are two modes in which the intel compiler supports coarray fortran:<br />
<br />
1. Single node usage<br />
<br />
2. Multiple node usage<br />
<br />
The way you compile and run for these two cases is different.<br />
<br />
<!-- <br />
However, we're working on making coarray fortran compilation and running more uniform among these two cases, as well as with the, as-yet-experimental, gfortran coarray support. See [[#Uniformized_Usage | Uniformized Usage]] below.<br />
--><br />
<br />
Note: For multiple node usage, it makes sense to have to load the IntelMPI module, since Intel's implementation of Co-array Fortran uses MPI. However, the Intel MPI module is needed even for single-node usage, just in order to link successfully.<br />
<br />
===Single node usage===<br />
<br />
====Compilation====<br />
<source lang="bash"><br />
ifort -O3 -xHost -coarray=shared -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
ifort -coarray=shared [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
To run this co-array program on one node with 16 images (co-array version for what openmp calls a thread and mpi calls a process), you simply put<br />
<source lang="bash"><br />
./[executable]<br />
</source><br />
in your job submission script. The reason that this gives 16 images is that [[Slurm#Hyperthreading:_Logical_CPUs_vs._cores | HyperThreading]] is enabled on Niagara nodes, which makes it seem to the system as if there are 80 computing units on a node, even though physically there are only 40.<br />
<br />
To control the number of images, you can change the <tt>FOR_COARRAY_NUM_IMAGES</tt> environment variable:<br />
<source lang="bash"><br />
export FOR_COARRAY_NUM_IMAGES=2<br />
./[executable]<br />
</source><br />
This can be useful for testing.<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# SLURM submission script for SciNet Niagara (Intel Coarray Fortran)<br />
#<br />
#SBATCH --nodes=1<br />
#SBATCH --time=1:00:00<br />
#SBATCH --cpus-per-task=40<br />
#SBATCH --job-name test<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
<br />
# RUN THE APPLICATION WITH 80 IMAGES<br />
export FOR_COARRAY_NUM_IMAGES=80<br />
./[executable]<br />
</source><br />
<br />
<br />
===Multiple nodes usage===<br />
<br />
Please read over the following link: [https://software.intel.com/content/www/us/en/develop/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential.html] for the newer intel compilers<br />
<br />
<source lang="bash"><br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
</source><br />
<br />
====Compilation====<br />
<source lang="bash"><br />
ifort -O3 -xHost -coarray=distributed -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
ifort -coarray=distributed [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
Because distributed co-array fortran is based on MPI, we need to launch the mpi processes on different nodes. The defaults will<br />
work on Niagara, however the number of images will be equal to the number of nodes * cpus-per-<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
#<br />
#SBATCH --nodes=4<br />
#SBATCH --time=1:00:00<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --job-name test<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
<br />
# EXECUTION export COMMAND;FOR_COARRAY_NUM_IMAGES = nodes*ntasks-per-node<br />
./[executable]<br />
</source><br />
<br />
You can provide a configuration file using the ifort '-corray-config-file=file.cfg` option that will allow you to provide<br />
your own MPI parameters including the number of tasks per host and the number of total tasks, ie. images. <br />
<br />
<br />
<!--<br />
<br />
===Features known not to workin earlier versions===<br />
<br />
There are a few features that are known not to work in the current version of the Intel Fortran compiler (v12.0), such as character coarrays. See section 3.2.3.3 of the [http://software.intel.com/en-us/articles/intel-fortran-composer-xe-2011-release-notes/ official release notes]. These issues may get fixed in later releases.<br />
<br />
===Uniformized Usage===<br />
<br />
If you load the addition module<br />
<pre><br />
module load caf/intel/any<br />
</pre><br />
you get access to a compilation and linking wrapper called <tt>caf</tt> and a wrapper for running the application called <tt>cafrun</tt>.<br />
<br />
====Compilation====<br />
<source lang="bash"><br />
caf -O3 -xhost -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
caf [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
To run this co-array program on one node with 40 images, you simply put<br />
<source lang="bash"><br />
cafrun ./[executable]<br />
</source><br />
This runs 40 images, not 80.<br />
<br />
To control the number of images, you can change the run command to<br />
<source lang="bash"><br />
cafrun -np 2 ./[executable]<br />
</source><br />
This can be useful for testing.<br />
<br />
To control the number of images per node, add the <tt>-N [images-per-node]</tt> option.<br />
<br />
Note: currently, the uniformized mode doesn't explicitly utilize optimization opportunities offered by the single node mode, although it will work on one node.<br />
<br />
<br />
<br />
==GNU compiler instructions for Coarray Fortran==<br />
<br />
Coarray fortran is supported in the GNU compiler suite (GCC) starting from version 5.1. To implement coarrays, it uses the <tt>opencoarray</tt> library, which in turns uses openmpi (or at least, that's how it has been setup on the GPC).<br />
<br />
''Issues with the gcc/opencoarray fortran compilers seem to exist, particularly with multidimensional arrays. We're still investigating the cause, but for now, one should consider these coarray fortran support by gcc as experimental''<br />
<br />
===Loading necessary modules===<br />
First, you need to load the module for version 5.2 or greater of the GNU compilers (version 5.1 would've worked, but we skipped that release on the GPC), as well as OpenMPI.<br />
<pre><br />
module load gcc/5.2.0 openmpi/gcc/1.8.3 use.experimental caf/gcc/5.2.0-openmpi<br />
</pre><br />
<br />
The caf/gcc/5.2.0-openmpi modules comes with a compilation and linking wrapper called <tt>caf</tt> and a wrapper for running the application called <tt>cafrun</tt>.<br />
<br />
===Compilation===<br />
<source lang="bash"><br />
caf -O3 -march=native -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
===Linking===<br />
<source lang="bash"><br />
caf [objectfile] -o [executable]<br />
</source><br />
<br />
===Running===<br />
To run this co-array program on one node with 8 images (co-array version for what openmp calls a thread and mpi calls a process), you simply put<br />
<source lang="bash"><br />
cafrun ./[executable]<br />
</source><br />
in your job submission script. In contrast with the Intel compiler, this does not runs 16 images, but only 8. The reason is that the gcc/opencoarray implementation uses MPI, and MPI is not aware of [[GPC_Quickstart#HyperThreading | HyperThreading]]. <br />
<br />
To control the number of images, you can change the run command to<br />
<source lang="bash"><br />
cafrun -np 2 ./[executable]<br />
</source><br />
This can be useful for testing, or to exploit HyperThreading.<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# MOAB/Torque submission script for SciNet GPC (GCC Coarray Fortran)<br />
#<br />
#PBS -l nodes=1:ppn=8,walltime=1:00:00<br />
#PBS -N test<br />
<br />
# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from<br />
cd $PBS_O_WORKDIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load gcc/5.2.0 openmpi/gcc/1.8.3<br />
<br />
# RUN WITH 16 IMAGES ON 1 NODE<br />
cafrun -np 16 ./[executable]<br />
</source><br />
<br />
===Multiple nodes usage===<br />
<br />
Because the GNU implementation of Coarray Fortran in the gcc/5.2.0 module is based on MPI, running on multiple nodes is no different from the single-node usage.<br />
An example multi-node submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# MOAB/Torque submission script for SciNet GPC (GCC Coarray Fortran on multiple nodes)<br />
#<br />
#PBS -l nodes=4:ppn=8,walltime=1:00:00<br />
#PBS -N test<br />
<br />
# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from<br />
cd $PBS_O_WORKDIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load gcc/5.2.0 openmpi/gcc/1.8.3<br />
<br />
# EXECUTION with 32 images (nodes*ppn)<br />
cafrun -np 32 ./[executable]<br />
</source><br />
<br />
--></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Co-array_Fortran_on_Niagara&diff=2938Co-array Fortran on Niagara2021-02-03T01:13:49Z<p>Northrup: </p>
<hr />
<div>Versions 12 and higher of the Intel Fortran compiler, and version 5.1 and up of the GNU Fortran compiler, support almost all of Co-array Fortran, and are installed on [[Niagara Quickstart | Niagara]]. <br />
<br />
This page will briefly sketch how to compile and run Co-array Fortran programs using these compilers.<br />
<br />
==Example==<br />
Here is an example of a co-array fortran program:<br />
<source lang="fortran"><br />
program Hello_World<br />
integer :: i ! Local variable<br />
integer :: num[*] ! scalar coarray<br />
if (this_image() == 1) then<br />
write(*,'(a)') 'Enter a number: '<br />
read(*,'(i80)') num<br />
! Distribute information to other images<br />
do i = 2, num_images()<br />
num[i] = num<br />
end do<br />
end if<br />
sync all ! Barrier to make sure the data has arrived<br />
! I/O from all nodes<br />
write(*,'(a,i0,a,i0)') 'Hello ',num,' from image ', this_image()<br />
end program Hello_world<br />
</source><br />
(Adapted from [http://en.wikipedia.org/wiki/Co-array_Fortran]).<br />
<br />
Compiling, linking and running co-array fortran programs is different depending on whether you will run the program only on a single node (with 8 cores), or on several nodes, and depends on which compiler you are using, Intel, or GNU.<br />
<br />
==Intel compiler instructions for Coarray Fortran==<br />
<br />
===Loading necessary modules===<br />
First, you need to load the module for version 12 or greater of the Intel compilers, as well as Intel MPI.<br />
<pre><br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
</pre><br />
<br />
There are two modes in which the intel compiler supports coarray fortran:<br />
<br />
1. Single node usage<br />
<br />
2. Multiple node usage<br />
<br />
The way you compile and run for these two cases is different.<br />
<br />
<!-- <br />
However, we're working on making coarray fortran compilation and running more uniform among these two cases, as well as with the, as-yet-experimental, gfortran coarray support. See [[#Uniformized_Usage | Uniformized Usage]] below.<br />
--!><br />
<br />
Note: For multiple node usage, it makes sense to have to load the IntelMPI module, since Intel's implementation of Co-array Fortran uses MPI. However, the Intel MPI module is needed even for single-node usage, just in order to link successfully.<br />
<br />
===Single node usage===<br />
<br />
====Compilation====<br />
<source lang="bash"><br />
ifort -O3 -xHost -coarray=shared -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
ifort -coarray=shared [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
To run this co-array program on one node with 16 images (co-array version for what openmp calls a thread and mpi calls a process), you simply put<br />
<source lang="bash"><br />
./[executable]<br />
</source><br />
in your job submission script. The reason that this gives 16 images is that [[Slurm#Hyperthreading:_Logical_CPUs_vs._cores | HyperThreading]] is enabled on Niagara nodes, which makes it seem to the system as if there are 80 computing units on a node, even though physically there are only 40.<br />
<br />
To control the number of images, you can change the <tt>FOR_COARRAY_NUM_IMAGES</tt> environment variable:<br />
<source lang="bash"><br />
export FOR_COARRAY_NUM_IMAGES=2<br />
./[executable]<br />
</source><br />
This can be useful for testing.<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# SLURM submission script for SciNet Niagara (Intel Coarray Fortran)<br />
#<br />
#SBATCH --nodes=1<br />
#SBATCH --time=1:00:00<br />
#SBATCH --cpus-per-task=40<br />
#SBATCH --job-name test<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
<br />
# RUN THE APPLICATION WITH 80 IMAGES<br />
export FOR_COARRAY_NUM_IMAGES=80<br />
./[executable]<br />
</source><br />
<br />
<br />
===Multiple nodes usage===<br />
<br />
Please read over the following link: [https://software.intel.com/content/www/us/en/develop/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential.html] for the newer intel compilers<br />
<br />
<source lang="bash"><br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
</source><br />
<br />
====Compilation====<br />
<source lang="bash"><br />
ifort -O3 -xHost -coarray=distributed -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
ifort -coarray=distributed [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
Because distributed co-array fortran is based on MPI, we need to launch the mpi processes on different nodes. The defaults will<br />
work on Niagara, however the number of images will be equal to the number of nodes * cpus-per-<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
#<br />
#SBATCH --nodes=4<br />
#SBATCH --time=1:00:00<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --job-name test<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
<br />
# EXECUTION export COMMAND;FOR_COARRAY_NUM_IMAGES = nodes*ntasks-per-node<br />
./[executable]<br />
</source><br />
<br />
You can provide a configuration file using the ifort '-corray-config-file=file.cfg` option that will allow you to provide<br />
your own MPI parameters including the number of tasks per host and the number of total tasks, ie. images. <br />
<br />
<br />
<!--<br />
<br />
===Features known not to workin earlier versions===<br />
<br />
There are a few features that are known not to work in the current version of the Intel Fortran compiler (v12.0), such as character coarrays. See section 3.2.3.3 of the [http://software.intel.com/en-us/articles/intel-fortran-composer-xe-2011-release-notes/ official release notes]. These issues may get fixed in later releases.<br />
<br />
===Uniformized Usage===<br />
<br />
If you load the addition module<br />
<pre><br />
module load caf/intel/any<br />
</pre><br />
you get access to a compilation and linking wrapper called <tt>caf</tt> and a wrapper for running the application called <tt>cafrun</tt>.<br />
<br />
====Compilation====<br />
<source lang="bash"><br />
caf -O3 -xhost -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
caf [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
To run this co-array program on one node with 40 images, you simply put<br />
<source lang="bash"><br />
cafrun ./[executable]<br />
</source><br />
This runs 40 images, not 80.<br />
<br />
To control the number of images, you can change the run command to<br />
<source lang="bash"><br />
cafrun -np 2 ./[executable]<br />
</source><br />
This can be useful for testing.<br />
<br />
To control the number of images per node, add the <tt>-N [images-per-node]</tt> option.<br />
<br />
Note: currently, the uniformized mode doesn't explicitly utilize optimization opportunities offered by the single node mode, although it will work on one node.<br />
<br />
<br />
<br />
==GNU compiler instructions for Coarray Fortran==<br />
<br />
Coarray fortran is supported in the GNU compiler suite (GCC) starting from version 5.1. To implement coarrays, it uses the <tt>opencoarray</tt> library, which in turns uses openmpi (or at least, that's how it has been setup on the GPC).<br />
<br />
''Issues with the gcc/opencoarray fortran compilers seem to exist, particularly with multidimensional arrays. We're still investigating the cause, but for now, one should consider these coarray fortran support by gcc as experimental''<br />
<br />
===Loading necessary modules===<br />
First, you need to load the module for version 5.2 or greater of the GNU compilers (version 5.1 would've worked, but we skipped that release on the GPC), as well as OpenMPI.<br />
<pre><br />
module load gcc/5.2.0 openmpi/gcc/1.8.3 use.experimental caf/gcc/5.2.0-openmpi<br />
</pre><br />
<br />
The caf/gcc/5.2.0-openmpi modules comes with a compilation and linking wrapper called <tt>caf</tt> and a wrapper for running the application called <tt>cafrun</tt>.<br />
<br />
===Compilation===<br />
<source lang="bash"><br />
caf -O3 -march=native -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
===Linking===<br />
<source lang="bash"><br />
caf [objectfile] -o [executable]<br />
</source><br />
<br />
===Running===<br />
To run this co-array program on one node with 8 images (co-array version for what openmp calls a thread and mpi calls a process), you simply put<br />
<source lang="bash"><br />
cafrun ./[executable]<br />
</source><br />
in your job submission script. In contrast with the Intel compiler, this does not runs 16 images, but only 8. The reason is that the gcc/opencoarray implementation uses MPI, and MPI is not aware of [[GPC_Quickstart#HyperThreading | HyperThreading]]. <br />
<br />
To control the number of images, you can change the run command to<br />
<source lang="bash"><br />
cafrun -np 2 ./[executable]<br />
</source><br />
This can be useful for testing, or to exploit HyperThreading.<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# MOAB/Torque submission script for SciNet GPC (GCC Coarray Fortran)<br />
#<br />
#PBS -l nodes=1:ppn=8,walltime=1:00:00<br />
#PBS -N test<br />
<br />
# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from<br />
cd $PBS_O_WORKDIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load gcc/5.2.0 openmpi/gcc/1.8.3<br />
<br />
# RUN WITH 16 IMAGES ON 1 NODE<br />
cafrun -np 16 ./[executable]<br />
</source><br />
<br />
===Multiple nodes usage===<br />
<br />
Because the GNU implementation of Coarray Fortran in the gcc/5.2.0 module is based on MPI, running on multiple nodes is no different from the single-node usage.<br />
An example multi-node submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# MOAB/Torque submission script for SciNet GPC (GCC Coarray Fortran on multiple nodes)<br />
#<br />
#PBS -l nodes=4:ppn=8,walltime=1:00:00<br />
#PBS -N test<br />
<br />
# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from<br />
cd $PBS_O_WORKDIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load gcc/5.2.0 openmpi/gcc/1.8.3<br />
<br />
# EXECUTION with 32 images (nodes*ppn)<br />
cafrun -np 32 ./[executable]<br />
</source><br />
<br />
--!></div>Northruphttps://docs.scinet.utoronto.ca/index.php?title=Co-array_Fortran_on_Niagara&diff=2937Co-array Fortran on Niagara2021-02-03T01:13:38Z<p>Northrup: </p>
<hr />
<div>Versions 12 and higher of the Intel Fortran compiler, and version 5.1 and up of the GNU Fortran compiler, support almost all of Co-array Fortran, and are installed on [[Niagara Quickstart | Niagara]]. <br />
<br />
This page will briefly sketch how to compile and run Co-array Fortran programs using these compilers.<br />
<br />
==Example==<br />
Here is an example of a co-array fortran program:<br />
<source lang="fortran"><br />
program Hello_World<br />
integer :: i ! Local variable<br />
integer :: num[*] ! scalar coarray<br />
if (this_image() == 1) then<br />
write(*,'(a)') 'Enter a number: '<br />
read(*,'(i80)') num<br />
! Distribute information to other images<br />
do i = 2, num_images()<br />
num[i] = num<br />
end do<br />
end if<br />
sync all ! Barrier to make sure the data has arrived<br />
! I/O from all nodes<br />
write(*,'(a,i0,a,i0)') 'Hello ',num,' from image ', this_image()<br />
end program Hello_world<br />
</source><br />
(Adapted from [http://en.wikipedia.org/wiki/Co-array_Fortran]).<br />
<br />
Compiling, linking and running co-array fortran programs is different depending on whether you will run the program only on a single node (with 8 cores), or on several nodes, and depends on which compiler you are using, Intel, or GNU.<br />
<br />
==Intel compiler instructions for Coarray Fortran==<br />
<br />
===Loading necessary modules===<br />
First, you need to load the module for version 12 or greater of the Intel compilers, as well as Intel MPI.<br />
<pre><br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
</pre><br />
<br />
There are two modes in which the intel compiler supports coarray fortran:<br />
<br />
1. Single node usage<br />
<br />
2. Multiple node usage<br />
<br />
The way you compile and run for these two cases is different.<br />
<br />
<!-- However, we're working on making coarray fortran compilation and running more uniform among these two cases, as well as with the, as-yet-experimental, gfortran coarray support. See [[#Uniformized_Usage | Uniformized Usage]] below.<br />
--!><br />
<br />
Note: For multiple node usage, it makes sense to have to load the IntelMPI module, since Intel's implementation of Co-array Fortran uses MPI. However, the Intel MPI module is needed even for single-node usage, just in order to link successfully.<br />
<br />
===Single node usage===<br />
<br />
====Compilation====<br />
<source lang="bash"><br />
ifort -O3 -xHost -coarray=shared -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
ifort -coarray=shared [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
To run this co-array program on one node with 16 images (co-array version for what openmp calls a thread and mpi calls a process), you simply put<br />
<source lang="bash"><br />
./[executable]<br />
</source><br />
in your job submission script. The reason that this gives 16 images is that [[Slurm#Hyperthreading:_Logical_CPUs_vs._cores | HyperThreading]] is enabled on Niagara nodes, which makes it seem to the system as if there are 80 computing units on a node, even though physically there are only 40.<br />
<br />
To control the number of images, you can change the <tt>FOR_COARRAY_NUM_IMAGES</tt> environment variable:<br />
<source lang="bash"><br />
export FOR_COARRAY_NUM_IMAGES=2<br />
./[executable]<br />
</source><br />
This can be useful for testing.<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# SLURM submission script for SciNet Niagara (Intel Coarray Fortran)<br />
#<br />
#SBATCH --nodes=1<br />
#SBATCH --time=1:00:00<br />
#SBATCH --cpus-per-task=40<br />
#SBATCH --job-name test<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
<br />
# RUN THE APPLICATION WITH 80 IMAGES<br />
export FOR_COARRAY_NUM_IMAGES=80<br />
./[executable]<br />
</source><br />
<br />
<br />
===Multiple nodes usage===<br />
<br />
Please read over the following link: [https://software.intel.com/content/www/us/en/develop/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential.html] for the newer intel compilers<br />
<br />
<source lang="bash"><br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
</source><br />
<br />
====Compilation====<br />
<source lang="bash"><br />
ifort -O3 -xHost -coarray=distributed -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
ifort -coarray=distributed [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
Because distributed co-array fortran is based on MPI, we need to launch the mpi processes on different nodes. The defaults will<br />
work on Niagara, however the number of images will be equal to the number of nodes * cpus-per-<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
#<br />
#SBATCH --nodes=4<br />
#SBATCH --time=1:00:00<br />
#SBATCH --ntasks-per-node=40<br />
#SBATCH --job-name test<br />
<br />
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from<br />
cd $SLURM_SUBMIT_DIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load NiaEnv/2019b intel/2019u4 intelmpi/2019u4<br />
<br />
# EXECUTION export COMMAND;FOR_COARRAY_NUM_IMAGES = nodes*ntasks-per-node<br />
./[executable]<br />
</source><br />
<br />
You can provide a configuration file using the ifort '-corray-config-file=file.cfg` option that will allow you to provide<br />
your own MPI parameters including the number of tasks per host and the number of total tasks, ie. images. <br />
<br />
<br />
<!--<br />
<br />
===Features known not to workin earlier versions===<br />
<br />
There are a few features that are known not to work in the current version of the Intel Fortran compiler (v12.0), such as character coarrays. See section 3.2.3.3 of the [http://software.intel.com/en-us/articles/intel-fortran-composer-xe-2011-release-notes/ official release notes]. These issues may get fixed in later releases.<br />
<br />
===Uniformized Usage===<br />
<br />
If you load the addition module<br />
<pre><br />
module load caf/intel/any<br />
</pre><br />
you get access to a compilation and linking wrapper called <tt>caf</tt> and a wrapper for running the application called <tt>cafrun</tt>.<br />
<br />
====Compilation====<br />
<source lang="bash"><br />
caf -O3 -xhost -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
====Linking====<br />
<source lang="bash"><br />
caf [objectfile] -o [executable]<br />
</source><br />
<br />
====Running====<br />
To run this co-array program on one node with 40 images, you simply put<br />
<source lang="bash"><br />
cafrun ./[executable]<br />
</source><br />
This runs 40 images, not 80.<br />
<br />
To control the number of images, you can change the run command to<br />
<source lang="bash"><br />
cafrun -np 2 ./[executable]<br />
</source><br />
This can be useful for testing.<br />
<br />
To control the number of images per node, add the <tt>-N [images-per-node]</tt> option.<br />
<br />
Note: currently, the uniformized mode doesn't explicitly utilize optimization opportunities offered by the single node mode, although it will work on one node.<br />
<br />
<br />
<br />
==GNU compiler instructions for Coarray Fortran==<br />
<br />
Coarray fortran is supported in the GNU compiler suite (GCC) starting from version 5.1. To implement coarrays, it uses the <tt>opencoarray</tt> library, which in turns uses openmpi (or at least, that's how it has been setup on the GPC).<br />
<br />
''Issues with the gcc/opencoarray fortran compilers seem to exist, particularly with multidimensional arrays. We're still investigating the cause, but for now, one should consider these coarray fortran support by gcc as experimental''<br />
<br />
===Loading necessary modules===<br />
First, you need to load the module for version 5.2 or greater of the GNU compilers (version 5.1 would've worked, but we skipped that release on the GPC), as well as OpenMPI.<br />
<pre><br />
module load gcc/5.2.0 openmpi/gcc/1.8.3 use.experimental caf/gcc/5.2.0-openmpi<br />
</pre><br />
<br />
The caf/gcc/5.2.0-openmpi modules comes with a compilation and linking wrapper called <tt>caf</tt> and a wrapper for running the application called <tt>cafrun</tt>.<br />
<br />
===Compilation===<br />
<source lang="bash"><br />
caf -O3 -march=native -c [sourcefile] -o [objectfile]<br />
</source><br />
<br />
===Linking===<br />
<source lang="bash"><br />
caf [objectfile] -o [executable]<br />
</source><br />
<br />
===Running===<br />
To run this co-array program on one node with 8 images (co-array version for what openmp calls a thread and mpi calls a process), you simply put<br />
<source lang="bash"><br />
cafrun ./[executable]<br />
</source><br />
in your job submission script. In contrast with the Intel compiler, this does not runs 16 images, but only 8. The reason is that the gcc/opencoarray implementation uses MPI, and MPI is not aware of [[GPC_Quickstart#HyperThreading | HyperThreading]]. <br />
<br />
To control the number of images, you can change the run command to<br />
<source lang="bash"><br />
cafrun -np 2 ./[executable]<br />
</source><br />
This can be useful for testing, or to exploit HyperThreading.<br />
<br />
An example submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# MOAB/Torque submission script for SciNet GPC (GCC Coarray Fortran)<br />
#<br />
#PBS -l nodes=1:ppn=8,walltime=1:00:00<br />
#PBS -N test<br />
<br />
# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from<br />
cd $PBS_O_WORKDIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load gcc/5.2.0 openmpi/gcc/1.8.3<br />
<br />
# RUN WITH 16 IMAGES ON 1 NODE<br />
cafrun -np 16 ./[executable]<br />
</source><br />
<br />
===Multiple nodes usage===<br />
<br />
Because the GNU implementation of Coarray Fortran in the gcc/5.2.0 module is based on MPI, running on multiple nodes is no different from the single-node usage.<br />
An example multi-node submission script would look as follows:<br />
<br />
<source lang="bash"><br />
#!/bin/bash<br />
# MOAB/Torque submission script for SciNet GPC (GCC Coarray Fortran on multiple nodes)<br />
#<br />
#PBS -l nodes=4:ppn=8,walltime=1:00:00<br />
#PBS -N test<br />
<br />
# DIRECTORY TO RUN - $PBS_O_WORKDIR is directory job was submitted from<br />
cd $PBS_O_WORKDIR<br />
<br />
# LOAD MODULES THAT THE APPLICATION WAS COMPILED WITH<br />
module load gcc/5.2.0 openmpi/gcc/1.8.3<br />
<br />
# EXECUTION with 32 images (nodes*ppn)<br />
cafrun -np 32 ./[executable]<br />
</source><br />
<br />
--!></div>Northrup