Previous messages

From SciNet Users Documentation
Revision as of 13:16, 4 July 2023 by Rzon (talk | contribs)
Jump to navigation Jump to search

Wed Jun 21 16:03:45 EDT 2023: Niagara's scheduler maintenance is finished.

Wed Jun 21 15:42:00 EDT 2023: Niagara's scheduler is rebooting in 10 minutes for a short maintenance down time.

Wed Jun 21, 2023, 11:25 AM EDT: Maintenance is finished and Teach cluster is accessible again.

Tue Jun 20, 2023, 9:55 AM EDT: Teach cluster is powered off for maintenance.

Tue June 20, 2023: Announcement:
The Teach cluster at SciNet will undergo a maintenance shutdown starting on Tuesday June 20, 2023. It will likely take a few days before it will be available again. Check here for updates.

Mon Jun 5, 2023, 2:35 PM EDT: All systems are operational again.

Mon Jun 5, 2023, 11:55 AM EDT: There were issues with the cooling system. The login nodes and file systems are now accessible again, but compute nodes are still off.

Mon Jun 5, 2023, 6:55 AM EDT: Issues at the data center, we are investigating.

Sat May 27, 2023, 21:00AM EDT: We have been able to mitigate the UPS issue for now, until new parts arrive sometime during the week. System will be accessible soon

Sat May 27, 2023, 16:00AM EDT: We identified an UPS/Power related issue on the datacenter, that is adversely affecting several components, in particular all file systems. Out of an abundance of caution we are shutting down the cluster, until the UPS situation is resolved. Ongoing jobs will be canceled.

Sat May 27, 2023, 11:18AM EDT: Filesystem issues, investigating.

Wed May 24, 2023, 11:40AM EDT: Mist login node is accessible again.

Wed May 24, 2023, 11:10 AM EDT: Mist login node is under maintenance and temporarily inaccessible to users.

Mon May 15, 2023, 10:08 AM EDT rebooting Mist-login node again

Mon May 15, 2023, 09:15 AM EDT rebooting Mist-login node

Mon May 01, 2023, 04:00 PM EDT done rebooting nia-login nodes

Mon May 01, 2023, 12:00 PM EDT rebooting all nia-login nodes one at a time

Mon May 01, 2023, 11:00 AM EDT nia-login07 is going to be rebooted.

Thu Apr 20, 2023, 12:05 PM EDT: Mist login node is accessible again.

Thu Apr 20, 2023, 11:30 AM EDT: Mist login node is under maintenance and temporarily inaccessible to users.

Thu Apr 20, 2023, 8:27 AM EDT: Intermittent file system issues. We are investigating. For now (10:45 AM), the file systems appear operational.

Fri 14 Apr 2023 10:25 AM EDT: Switch problem resolved.

Fri 14 Apr 2023 10:10 AM EDT: A switch problem is affecting access to certain equipment at the SciNet data center, including the Teach cluster. Niagara and Mist are accessible.

Fri 14 Apr 2023 09:55 AM EDT: SciNet Jupyter Hub maintenance is finished and it is again available for users.

Fri 14 Apr 2023: SciNet Jupyter Hub will be restarted for system updates this morning. Keep in mind to save your notebooks!

Thu 06 Apr 2023 03:40 PM EDT: Rouge cluster is accessible again.

Thu 06 Apr 2023 01:00 PM EDT: Rouge cluster is temporarily inaccessible to users due to the electrical work.

Sun 02 Apr 2023 03:37 AM EDT: IO/read errors on the file system seem to have been fixed. Please resubmit your jobs, and report any further problems to support. Burst Buffer will remain offline for now.

Sun 02 Apr 2023 00:18 AM EDT: File System is back up, but there seems to be some IO/read errors. All running jobs have been killed. Please hold off on submitting jobs until further notice.

Sat 01 Apr 2023 10:17 PM EDT: We are having issues with the File System. Currently investigating the cause.

Fri 31 Mar 2023 11:00 PM EDT: Burst Buffer may be the culprit. We are investigating but may have to take Burst Buffer offline.

Fri 31 Mar 2023 01:30 PM EDT: File system issues causing trouble for some jobs on Niagara and Mist


Tue 28 Mar 2023 11:05 AM EDT: Mist login node is accessible again.

Tue 28 Mar 2023 10:35 AM EDT: Mist login node is under maintenance and temporarily inaccessible to users.

Fri 17 Mar 2023 14:50 PM EDT: All systems online.

Fri 17 Mar 2023 11:00 AM EDT: Problem identified and repaired. Starting to bring up systems, but not available to users yet.

Fri 17 Mar 2023 09:15:39 EDT: Staff on site and ticket opened with cooling contractor, cause of failure unclear

Fri 17 Mar 2023 01:47:43 EDT: Cooling system malfunction, datacentre is shut down.

Tue Feb 28, 16:40 EST: All systems are back online.

Tue Feb 28, 15:30 EST: Maintenance is complete. Bringing up systems.

Tue Feb 28, 7:10 AM EST: Maintenance shutdown resuming.

Mon Feb 27, 3:55 PM EST: Maintenance paused as parts were delayed. The maintenance will resume tomorrow (Tue Feb 28) at 7AM EST for about 5 hours. In the meantime, the login nodes of the systems will be brought online.

Mon Feb 27, 7:20 AM EST: Maintenance shutdown started.

February 27 and 28, 2023: SciNet Data Centre Maintenance:
This annual winter maintenance involves a full data centre shutdown starting at 7:00 a.m. EST on Monday, February 27. None of the SciNet systems (Niagara, Mist, Rouge, Teach, the file systems, as well as hosted equipment) will be accessible.

On the second day of the maintenance, Niagara, Mist, and their file systems are expected to become partially available for users. All systems should be fully available in the evening of the 28th.

The scheduler will hold jobs that cannot finish before the start of the shutdown. Users are encouraged to submit small and short jobs that can take advantage of this, as the scheduler may be able to fit these jobs in before the maintenance on otherwise idle nodes.

Feb 17, 2023, 11:15 PM EST: File system issues on Teach fixed and Teach is accessible again. Note that the file system of Teach is not very good at handling many remote vscode connections.

Feb 17, 2023, 11:02 PM EST: File system issues on Teach. We are working on a fix.

Sun Feb 12, 2023, 3:05 PM EST All systems are back online.

Sun Feb 12, 2023, 2:10 PM EST Powers restored, clusters are being started.

Sat Feb 11, 2023, 2:35 PM EST Powers interruption started. All compute nodes will be down, likely until Sunday afternoon.

Sat Feb 11, 2023, 1:20 PM EST There is to be an emergency power repair on the adjacent street. The datacentre will be switching over to generator. All compute nodes will be down.

Fri Feb 10, 2023, 10:55 AM EST All systems are back online.

Fri Feb 10, 2023, 10:00 AM EST Cooling issue resolved, cluster is being started.

Wed Jan 25, 2023, 02:15 PM EST Mist login node is accessible again.

Wed Jan 25, 2023, 10:30 AM EST Mist login node is under maintenance and temporarily inaccessible to users.

Mon Jan 23, 2023, around 7-8 AM EST Intermediate file system issuse may have killed your job. Users are advised to resubmit.

Sat Jan 21, 2023, 00:50 EST Niagara, Mist, Rouge and the filesystems are up

Fri Jan 20, 2023, 11:19 PM: EST Systems are coming up. We have determined that there was a general power glitch in the area of our Datacentre. The power has been fully restored

Fri Jan 20, 2023, 10:34 PM: EST Cooling is back. Systems are slowly coming up

Fri Jan 20, 2023, 8:20 PM: EST A cooling failure at the data center, possibly due to a power glitch. We are investigating.

Thu Jan 12, 2023, 9:30 AM EST File system is experiencing issues. Issues have stabilized, but jobs running around this time may have been affected.

Wed Dec 21, 2022, 12:00 PM: Please note that SciNet is on vacation, together with the University of Toronto. Full service will resume on Jan 2, 2023. We will endeavour to keep systems running, and answer tickets, on a best-effort basis. Happy Holidays!!!

Fri Dec 16, 2022, 2:19 PM: City power glitch caused all compute nodes to reboot. Please resubmit your jobs.

Mon Dec 12, 2022, 9:30 AM - 11:30: File system issues caused login issues and may have affected running jobs. System back to normal now, but users may want to check any jobs they had running.

Wed Dec 7, 2022, 11:40 AM EST: Systems are being brought back online.

Wed Dec 7, 2022, 09:00 AM EST: Maintenance is underway.

Announcement:

On Wednesday December 7th, 2022, the file systems of the SciNet's systems, Niagara, Mist, HPSS, Teach cluster, will undergo maintenance from 9:00 am EST. During the maintenance, there will be no access to any of these systems, as it requires all file system operations to have stopped. The maintenance should take about 1 hour, and all systems are expected to become available again later that morning.


Wed Nov 30, 2022, 14:45 PM EST: Mist login node is accessible again.

Wed Nov 30, 2022, 14:15 PM EST: Mist login node is under maintenance and temporarily inaccessible to users.

Thu Oct 20, 2022, 18:00 PM EDT: Systems are back online

Thu Oct 20, 2022, 09:40 AM EDT: About half of Niagara compute nodes are up. Note that only jobs that can finish by 5:00 PM will run.

Thu Oct 20, 2022, 07:50 AM EDT: Jupyter Hub is available again.

Thu Oct 20, 2022, 07:35 AM EDT: Jupyter Hub is being updated and temporarily inaccessible to users.

Thu Oct 20, 2022, 07:30 AM EDT: Maintenance is underway.

Announcement:

On Thursday October 20th, 2022, the SciNet datacentre (which hosts Niagara and Mist) will undergo transformer maintenance from 7:30 am EDT to 5:00 pm EDT. At both the start and end of this maintenance window, all systems will need to be briefly shutdown and will not be accessible. Apart from that, during this window, login nodes will be accessible and part of Niagara will be available to run jobs. The Mist and Rouge clusters will be off for the entirety of this maintenance.

Users are encouraged to submit Niagara jobs of about 1 to 2 hours in the days before the maintenance, as these could be run within the

window of 8 AM and 5 PM EDT.


Wed Oct 5, 2022, 12:10 PM EDT: A grid power glitch caused all compute nodes to reboot. Please resubmit your jobs.

Mon Oct 3, 2022, 11:20 PM EDT: Niagara login nodes are accessible from outside again.

Mon Oct 3, 2022, 9:20 PM EDT: Niagara login nodes are inaccessible from outside of the datacentre at the moment. As a work-around, ssh into mist.scinet.utoronto.ca and then ssh into e.g. nia-login01.

Wed Sep 28, 2022, 1:15 PM EDT: The JupyterHub maintenance is finished and it is now accessible to users.

Wed Sep 28, 2022, 1:00 PM EDT: The JupyterHub is to be rebooted for system upgrades. Running processes and notebooks will be closed. The service is expected to be back around 1:30 PM EDT.

Tue Sep 27, 2022, 11:50 AM EDT: Mist login node is accessible again.

Tue Sep 27, 2022, 11:25 AM EDT: Mist login node is under maintenance and temporarily inaccessible to users.

Mon Sep 26, 2022, 11:35 AM EDT: Rouge and Teach login nodes are accessible again.

Mon Sep 26, 2022, 11:05 AM EDT: Rouge and Teach login nodes are under maintenance and temporarily inaccessible to users.

Fri Sep 22, 2022, 0:46 AM EDT: The CCEnv software stack is back to normal.

Thu Sep 22, 2022, 8:15 PM EDT: The CCEnv software stack is inaccessible due to an issue with CVMFS.

Tue Sep 20, 2022, 16:00 AM EDT: Rouge login node is accessible again.

Tue Sep 20, 2022, 10:20 AM EDT: Rouge login node is under maintenance and temporarily inaccessible to users (hardware upgrade).

Tue Sep 20, 2022, 9:41 AM EDT: Rouge login node is back up.

Tue Sep 20, 2022, 8:25 AM EDT: Rouge login node down, we are investigating.

Fri Sept 16, 2022, 9:30 AM EDT: Login nodes are accessible again.

Fri Sept 16, 2022, 9:00 AM EDT: Login nodes are not accessible. We are investigating.

Tue Sep 13, 2022, 11:00 AM EDT: Mist login node is available again.

Tue Sep 13, 2022, 10:00 AM EDT: Mist login node is under maintenance and temporarily inaccessible to users.

Fri Sep 2, 2022, 11:25 AM EDT: Rouge login node is back up.

Fri Sep 2, 2022, 10:25 AM EDT: Issues with the Rouge login node; we are investigating.

Tue Aug 23, 2022, 1:15 PM EDT: Jupyter Hub is available again.

Tue Aug 23, 2022, 1:00 PM EDT: Jupyter Hub is being updated and temporarily inaccessible to users.

Fri Aug 12, 2022, 6:30 PM EDT: File system issues are resolved.

Fri Aug 12, 2022, 5:06 PM EDT: File system issues. We are investigating.

Thu Aug 11, 2022, 9:20 AM EDT: The login node issues have been resolved.

Thu Aug 11, 2022, 7:50 AM EDT: We are having problems accessing the Niagara login nodes. Until fixed, please login to Mist and then ssh to a Niagara login node to access Niagara ("ssh nia-login02", for example).

Fri July 15, 2022, 10:50 AM EDT: Jupyter Hub is available again.

Fri July 15, 2022, 10:30 AM EDT: Jupyter Hub is being updated and temporarily inaccessible to users.

Wed June 16, 2022, 3:45 PM EDT: File system is stable now. We're gradually opening the systems up.

Wed June 16, 2022, 10:15 AM EDT: Emergency maintenance shutdown of filesystem. Running jobs will be affected.

Wed June 15, 2022, 7:35 PM EDT: Maintenance shutdown finished. Most systems are available again.

Wed June 15, 2022, 7:00 AM EDT: Maintenance shutdown of the SciNet datacentre. There will be no access to any of the SciNet systems during this time. We expect to be able to bring the systems back online in the evening of June 15th.

Mon June 13, 2022, 7:00 AM EDT - Wed June 15, 2022, 7:00 AM EDT: Two-day reservation for the "Niagara at Scale" event. Only "Niagara at Scale" projects will run on the compute nodes (as well as SOSCIP projects, on a subset of nodes). Users are encouraged to submit small and short jobs that could run before this event. Throughout the event, users can still login, access their data, and submit jobs, but these jobs will not run until after the subsequent maintenance (see below). Note that the debugjob queue will remain available to everyone as well.

Mon May 30th, 2022, 12:42:00 EDT: Mist login node is available again.

Mon May 30th, 2022, 10:22:00 EDT: Mist login node is being upgraded and temporarily inaccessible to users.

Wed May 25th, 2022, 13:30:00 EDT: Niagara operating at 100% again.

Tue May 24th, 2022, 21:30:00 EDT: Jupyter Hub up. Part of Niagara can run compute jobs.

Tue May 24th, 2022, 19:00:00 EDT: Systems are up. Users can login, BUT cannot submit jobs yet.

Tue May 24th, 2022, 10:00:00 EDT: We are still performing system checks.

Mon May 23rd, 2022, 16:44:30 EDT: Systems still down. Filesystems are working, but there are quite a number of drive failures - no data loss - so out of an abundance of caution we are keeping the systems down at least until tomorrow. The long weekend has also been disruptive for service response, and we prefer to err on the safe side.

Mon May 23rd, 2022, 08:12:14 EDT: Systems still down. Filesystems being checked to ensure no heat damage.

Sun May 22nd, 2022, 10.16 am EDT: Electrician dispatched to replace blown fuses.

Sun May 22nd, 2022, 2:54 am EDT: Automatic shutdown down due to power/cooling.

Fri May 6th, 2022, 11:35 am EDT: HPSS scheduler upgrade also finished.

Thu May 5th, 2022, 7:45 pm EDT: Upgrade of the scheduler has finished, with the exception of HPSS.

Thu May 5th, 2022, 7:00 am - 3:00 pm EDT (approx): Starting from 7:00 am EDT, an upgrade of the scheduler of the Niagara, Mist, and Rouge clusters will be applied. This requires the scheduler to be down for about 5-6 hours, and all compute and login nodes to be rebooted. Jobs cannot be submitted during this maintenance, but jobs submitted beforehand will remain in the queue. For most of the time, the login nodes of the clusters will be available so that users may access their files on the home, scratch, and project file systems.

Monday May 2nd, 2022, 9:30 - 11:00 am EDT: the Niagara login nodes, the jupyter hub, and nia-datamover2 will get rebooted for updates. In the process, any login sessions will get disconnected, and servers on the jupyterhub will stop. Jobs in the Niagara queue will not be affected.

Tue Apr 26, 11:20 AM EDT: A Rolling update of the Mist cluster is taking a bit longer than expected, affecting logins to Mist.

Announcement: On Thursday April 14th, 2022, the connectivity to the SciNet datacentre will be disrupted at 11:00 AM EDT for a few minutes, in order to deploy a new network core switch. Any SSH connections or data transfers to SciNet systems (Niagara, Mist, etc.) may be terminated at that time.

Thu March 24, 6:54 AM EST: HPSS is back online

Thu March 24, 8:15 AM EST: HPSS has a hardware problem

Wed March 2, 4:50 PM EST: The CCEnv software stack is available again on Niagara.

Wed March 2, 7:50 AM EST: The CCEnv software stack on Niagara has issues; we are investigating.

Sat Feb 12 2022, 12:59 EST: Jupyterhub is back up, but may have hardware issue.

Sat Feb 12 2022, 10:36 EST: Issue with the Jupyterhub, since last night. We're investigating.

Tue Feb 1 2022 19:20 EST: Maintenance finished successfully. Systems are up.

Tue Feb 1 2022 13:00 EST: Maintenance downtime started.

Mon Jan 31 2022 13:15:00 EST: The SciNet datacentre's cooling system needs an emergency repair as soon as possible. During this repair, all systems hosted at SciNet (Niagara, Mist, Rouge, HPSS, and Teach) will need to be switched off and will be unavailable to users. Repairs will start Tuesday February 1st, at 1:00 pm EST, and could take until the end of the next day. Please check here for updates.

Sat Jan 29 2020 16:45:38 EST: Fibre repaired.

Sat 29 Jan 2022 11:22:27 EST: Fibre repair is underway. Expect to have connectivity restored later today.

Fri 28 Jan 2022 07:35:01 EST: The fibre optics cable that connects the SciNet datacentre was severed by uncoordinated digging at York University. We expect repairs to happen as soon as possible.

Thu Jan 27 12:46 EST PM 2022: Network issues to and from the datacentre. We are investigating.

Sun Jan 23 11:05 EST AM 2022: Filesystem issues appear to have resolved.

Sun Jan 23 10:30 EST AM 2022: Filesystem issues -- investigating.

Sat Jan 8 11:42 EST AM 2022: The emergency maintenance is complete. Systems are up and available.

Fri Jan 7 14:34 EST PM 2022: The SciNet shutdown is in progress. Systems are expected back on Saturday, Jan 8.

Emergency shutdown Friday January 7, 2022: An emergency shutdown of all SciNet to replace a crucial file system component is planned to take place on Friday January 7, 2022, starting at 8am EST, and will require at least 12 hours of downtime. Updates will be posted during the day.

Thu Jan 6 08:20 EST AM 2022 The SciNet filesystem is having issues. We are investigating.

Fri Dec 24 13:31 EST PM 2021 Please note the following scheduled network maintenance, which will result in loss of connectivity to the SciNet datacentre: Start time Dec 29, 00:30 EST Estimated duration 4 hours and 30 minutes.

Mon Dec 20 4:29 EST PM 2021 Filesystem is back to normal.

Mon Dec 20 2:53 EST PM 2021 Filesystem problem - We are investigating.

Wed Sep 23 12:30 EDT 2021 Cooling restored. Systems should be available later this afternoon.

Wed Sep 23 9:30 EDT 2021 Technicians on site working on cooling system.

Wed Sep 23 3:30 EDT 2021 Cooling system issues still unresolved.

Wed Sep 22 23:27:48 EDT 2021 Shutdown of the datacenter due to a problem with the cooling system.

Wed Sep 22 09:30 EDT 2021 : File system issues, resolved.

Wed Sep 22 07:30 EDT 2021 : File system issues, investigating.

Sun Sep 19 10:00 EDT 2021: Power glitch interrupted all compute jobs; please resubmit any jobs you had running.

Wed Sep 15 17:35 EDT 2021: filesystem issues resolved

Wed Sep 15 16:39 EDT 2021: filesystem issues

Mon Sep 13 13:15:07 EDT 2021 HPSS is back online.

Fri Sep 10 17:57:23 EDT 2021 HPSS is offline due to unscheduled maintenance.

Wed Aug 18 16:13:42 EDT 2021 The HPSS upgrade is complete.

HPSS Downtime August 17th and 18th, 2021 (Tuesday and Wednesday): We'll be upgrading the HPSS software to version 8.3, along with all the clients (htar/hsi, vfs and Globus/dsi)

July 24, 2021, 6:00 PM EDT: There appear to be file system issues, which may affect users' ability to login. We are investigating.

July 23th, 2021, 9:00 AM EDT: Security update: Due to a severe vulnerability in the Linux kernel (CVE-2021-33909), our team is currently patching and rebooting all login nodes and compute nodes, as well as the JupyterHub. There should be no affect on running jobs, however sessions on login and datamover nodes will be disrupted.

July 20th, 2021, 7:00 PM EDT: SLURM configuration - Changed the default behaviour to kill a job step if any task exits with a non-zero exit code. If your code is able to handle failures gracefully, please add srun's option --no-kill to recover the previous default behaviour.

July 20th, 2021, 7:00 PM EDT: Maintenance finished, systems are back online.

SciNet Downtime July 20th, 2021 (Tuesday): There will be a maintenance shutdown of the SciNet data center on Tuesday July 20th, starting at 7 am EDT. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time. We expect to be able to bring the systems back online in the evening of July 20th. The status of the Niagara cluster can be checked on status.computecanada.ca. For up-to-date and more detailed information on the status of all the SciNet systems, you can always check back here.

June 28th, 2021, 4:06 PM: Mist OS upgrade is complete.

May 27, 2021: Datamovers addresses have changed to improve high bandwidth connectivity and cybersecurity. The new addresses are 142.1.174.227 for nia-datamover1.scinet.utoronto.ca, and 142.1.174.228 for nia-datamover2.scinet.utoronto.ca.

If you have jobs that need to connect to a software license server using an ssh tunnel through nia-gw (which actually resolves to datamover1 or datamover2), you may need to ask the system administrators of that license server to allow incoming connections from the new addresses above. June 29th, 2021, 2:00 PM: Thunderstorm-related power fluctuations are causing some Niagara compute nodes and their jobs to crash. Please resubmit if your jobs seem to have crashed for no apparent reason.

June 28th, 2021, 4:06 PM: Mist OS upgrade is complete.

June 28th, 2021, 9:00 AM: Mist is under maintenance. OS upgrading from RHEL 7 to 8.

June 11th, 2021, 8:30 AM: Maintenance complete. Systems are up.

June 9th to 10th, 2021: The SciNet datacentre will have a scheduled maintenance shutdown. Niagara, Mist, Rouge, HPSS, login nodes, the file systems, and hosted systems will all be offline during the shutdown starting at 7AM EDT on Wednesday June 9th. We expect the systems to be back up in the morning of Friday June 11th. Check here for updates.

May 27, 2021: Datamovers addresses have changed to improve high bandwidth connectivity and cybersecurity. The new addresses are 142.1.174.227 for nia-datamover1.scinet.utoronto.ca, and 142.1.174.228 for nia-datamover2.scinet.utoronto.ca.

May 27th, 20:00. All systems are up and running

May 27th, 19:30. Most systems are up

May 27th, 19:00: Cooling is back. Powering up systems

May 27th, 2021, 11:30am: The cooling tower issue has been identified as a wiring issue and is being repaired. We don't have an ETA on when cooling will be restored, however we are hopeful it will be by the end of the day.

May 27th, 2021, 12:30am: Cooling tower motor is not working properly and may need to be replaced. Its the primary motor and the cooling system can not run without it, so at least until tomorrow all equipment at the datacenter will remain unavailable. Updates about expected repair times will be posted when they are known.

May 26th, 2021, 9:20pm: we are currently experiencing cooling issues at the SciNet data centre. Updates will be posted as we determine the cause of the problem.

From Tue Mar 30 at 12 noon EST to Thu Apr 1 at 12 noon EST, there will be a two-day reservation for the "Niagara at Scale" pilot event. During these 48 hours, only "Niagara at Scale" projects will run on the compute notes (as well as SOSCIP projects, on a subset of nodes). All other users can still login, access their data, and submit jobs throughout this event, but the jobs will not run until after the event. The debugjob queue will remain available to everyone as well.

The scheduler will not start batch jobs that cannot finish before the start of this event. Users can submit small and short jobs can take advantage of this, as the scheduler may be able to fit these jobs in before the event starts on the otherwise idle nodes.

Tue 23 Mar 2021 12:19:07 PM EDT - Planned external network maintenance 12pm-1pm Tuesday, March 23rd.

Thu Jan 28 17:35:16 EST 2021: HPSS services are back online

Thu Jan 28 12:36:21 EST 2021: HPSS services offline

We need a small maintenance window as early as possible still this afternoon to perform a small change in configuration. Ongoing jobs will be allowed to finish, but we are keeping new submissions on hold on the queue.

Mon Jan 25 13:16:33 EST 2021: HPSS services are back online

Sat Jan 23 10:03:33 EST 2021: HPSS services offline

We detected some type of hardware failure on our HPSS equipment overnight, so access has been disabled pending further investigation.

Fri Jan 22 10:49:29 EST 2021: The Globus transition to oauth is finished

Please deactivate any previous sessions to the niagara endpoint (in the last 7 days), and activate/login again.

For more details check https://docs.scinet.utoronto.ca/index.php/Globus#computecandada.23niagara

Jan 21, 2021: Globus access disruption on Fri, Jan/22/2021 10AM: Please be advised that we will have a maintenance window starting tomorrow at 10AM to roll out the transition of services to oauth based authentication.

Jan 15, 2021:Globus access update on Mon, Jan/18/2021 and Tue, Jan/19/2021: Please be advised we start preparations on Monday to perform update to Globus access on Tuesday. We'll be adopting oauth instead of myproxy from that point on. During this period expect sporadic disruptions of service. On Monday we'll already block access to nia-dm2, so please refrain from starting new login sessions or ssh tunnels via nia-dm2 from this weekend already.

December 11,2020, 12:00 AM EST: Cooling issue resolved. Systems back.

December 11,2020, 6:00 PM EST: Cooling issue at datacenter. All systems down.

December 7, 2020, 7:25 PM EST: All systems back; users can log in again.

December 7, 2020, 6:46 PM EST: User connectivity to data center not yet ready, but queued jobs on Mist and Niagara have been started.

December 7, 2020, 7:00 AM EST: Maintenance shutdown in effect. This is a one-day maintenance shutdown. There will be no access to Niagara, Mist, HPSS or teach, nor to their file systems during this time. We expect to be able to bring the systems back online this evening.

December 2, 2020, 9:10 PM EST: Power is back, systems are coming up. Please resubmit any jobs that failed because of this incident.

December 2, 2020, 6:00 PM EST: Power glitch at the data center, caused about half of the compute nodes to go down. Power issue not yet resolved.

Announcing a Maintenance Shutdown on December 7th, 2020
There will be a one-day maintenance shutdown on December 7th 2020, starting at 7 am EST. There will be no access to Niagara, Mist, HPSS or teach, nor to their file systems during this time. We expect to be able to bring the systems back online in the evening of the same day.

November 6, 2020, 8:00 PM EST: Systems are coming back online.

November 6, 2020, 9:49 AM EST: Repairs on the cooling system are underway. No ETA, but the systems will likely be back some time today.

November 6, 2020, 4:27 AM EST: Cooling system failure, datacentre is shut down.

October 9, 2020, 12:57 PM: A short power glitch caused many of the Niagara compute nodes to lose power; jobs running on them would have failed. Please check your jobs and resubmit.

October 8, 2020, 9:50 PM: Jupyterhub service is back up.

October 8, 2020, 5:40 PM: Jupyterhub service is down. We are investigating.

September 28, 2020, 11:00 AM EST: A short power glitch caused many of the Niagara compute nodes to lose power; jobs running on them would have failed. Please check your jobs and resubmit.

September 1, 2020, 2:15 PM EST: A short power glitch caused about half of the Niagara compute nodes to lose power; jobs running on them would have failed. Please check your jobs and resubmit.

September 1, 2020, 9:27 AM EST: The Niagara cluster has moved to a new default software stack, NiaEnv/2019b. If your job scripts used the previous default software stack before (NiaEnv/2018a), please put the command "module load NiaEnv/2018a" before other module commands in those scripts, to ensure they will continue to work, or try the new stack (recommended). August 24, 2020, 7:37 PM EST: Connectivity is back to normal

August 24, 2020, 6:35 PM EST: We have partial connectivity back, but are still investigating.

August 24, 2020, 3:15 PM EST: There are issues connecting to the data centre. We're investigating.

August 21, 2020, 6:00 PM EST: The pump has been repaired, cooling is restored, systems are up.
Scratch purging is postponed until the evening of Friday Aug 28th, 2020.

August 19, 2020, 4:40 PM EST: Update: The current estimate is to have the cooling restored on Friday and we hope to have the systems available for users on Saturday August 22, 2020.

August 17, 2020, 4:00 PM EST: Unfortunately after taking the pump apart it was determined there was a more serious failure of the main drive shaft, not just the seal. As a new one will need to be sourced or fabricated we're estimating that it will take at least a few more days to get the part and repairs done to restore cooling. Sorry for the inconvenience. 

August 15, 2020, 1:00 PM EST: Due to parts availablity to repair the failed pump and cooling system it is unlikely that systems will be able to be restored until Monday afternoon at the earliest.

August 15, 2020, 00:04 AM EST: A primary pump seal in the cooling infrastructure has blown and parts availability will not be able be determined until tomorrow. All systems are shut down as there is no cooling. If parts are available, systems may be back at the earliest late tomorrow. Check here for updates.

August 14, 2020, 21:04 AM EST: Tomorrow's /scratch purge has been postponed.

August 14, 2020, 21:00 AM EST: Staff at the datacenter. Looks like one of the pumps has a seal that is leaking badly.

August 14, 2020, 20:37 AM EST: We seem to be undergoing a thermal shutdown at the datacenter.

August 14, 2020, 20:20 AM EST: Network problems to niagara/mist. We are investigating.

August 13, 2020, 10:40 AM EST: Network is fixed, scheduler and other services are back.

August 13, 2020, 8:20 AM EST: We had an IB switch failure, which is affecting a subset of nodes, including the scheduler nodes.

August 10, 2020, 7:30 PM EST: Scheduler fully operational again.

August 10, 2020, 3:00 PM EST: Scheduler partially functional: jobs can be submitted and are running.

August 10, 2020, 2:00 PM EST: Scheduler is temporarily inoperational.

August 7, 2020, 9:15 PM EST: Network is fixed, scheduler and other services are coming back.

August 7, 2020, 8:20 PM EST: Disruption of part of the network in the data centre. Causes issue with the scheduler, the mist login node, and possibly others. We are investigating.

July 30, 2020, 9:00 AM Project backup in progress but incomplete: please be aware that after we deployed the new, larger storage appliance for scratch and project two months ago, we started a full backup of project (1.5PB). This backup is taking a while to complete, and there are still a few areas which have not been backed up fully. Please be careful to not delete things from project that you still need, in particular if they are recently added material.

July 27, 2020, 5:00 PM: Scheduler issues resolved.

July 27, 2020, 3:00 PM: Scheduler issues. We are investigating.

July 13, 4:40 PM: Most systems are available again. Only Mist is still being brought up.

July 13, 10:00 AM: SciNet/Niagara Downtime In Progress

SciNet/Niagara Downtime Announcement, July 13, 2020
All resources at SciNet will undergo a maintenance shutdown on Monday July 13, 2020, starting at 10:00 am EDT, for file system and scheduler upgrades. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time. We expect to be able to bring the systems back around 3 PM (EST) on the same day.

June 29, 6:21:00 PM: Systems are available again.

June 29, 12:30:00 PM: Power Outage caused thermal shutdown.

June 20, 2020, 10:24 PM: File systems are back up. Unfortunately, all running jobs would have died and users are asked to resubmit them.

June 20, 2020, 9:48 PM: An issue with the file systems is causing trouble. We are investigating the cause.

June 15, 2020, 10:30 PM: A power glitch caused some compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.

June 12, 2020, 6:15 PM: Two power glitches during the night caused some compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.

June 6, 2020, 6:06 AM: A power glitch caused some compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.

May 24, 2020, 8:20 AM: A power glitch this morning caused all compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.

May 7, 2020, 6:05 PM: Maintenance shutdown is finished. Most systems are back in production.

May 6, 2020, 7:08 AM: Two-day datacentre maintenance shutdown has started.

SciNet/Niagara Downtime Announcement, May 6-7, 2020

All resources at SciNet will undergo a two-day maintenance shutdown on May 6th and 7th 2020, starting at 7 am EDT on Wednesday May 6th. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) or systems hosted at the SciNet data centre. We expect to be able to bring the systems back online the evening of May 7th.

May 4, 2020, 7:51 AM: A power glitch this morning caused compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.

May 3, 2020, 8:20 AM: A power glitch this morning caused all compute nodes to be rebooted: jobs running at the time may have failed; users are asked to resubmit these jobs.

April 28, 2020, 7:20 AM: A power glitch this morning caused all compute nodes to be rebooted: jobs running at the time have failed; users are asked to resubmit these jobs.

April 20, 2020: Security Incident at Cedar; implications for Niagara users

Last week, it became evident that the Cedar GP cluster had been comprimised for several weeks. The passwords of at least two Compute Canada users were known to the attackers. One of these was used to escalate privileges on Cedar, as explained on https://status.computecanada.ca/view_incident?incident=423.

These accounts were used to login to Niagara as well, but Niagara did not have the same security loophole as Cedar (which has been fixed), and no further escalation was observed on Niagara.

Reassuring as that may sound, it is not known how the passwords of the two user accounts were obtained. Given this uncertainty, the SciNet team *strongly* recommends that you change your password on https://ccdb.computecanada.ca/security/change_password, and remove any SSH keys and regenerate new ones (see https://docs.scinet.utoronto.ca/index.php/SSH_keys).

Tue 30 Mar 2020 14:55:14 EDT Burst Buffer available again.

Fri Mar 27 15:29:00 EDT 2020: SciNet systems are back up. Only the Burst Buffer remains offline, its maintenance is expected to be finished early next week.

Thu Mar 26 23:05:00 EDT 2020: Some aspects of the maintenance took longer than expected. The systems will not be back up until some time tomorrow, Friday March 27, 2020.

Wed Mar 25 7:00:00 EDT 2020: SciNet/Niagara downtime started.

Mon Mar 23 18:45:10 EDT 2020: File system issues were resolved.

Mon Mar 23 18:01:19 EDT 2020: There is currently an issue with the main Niagara filesystems. This effects all systems, all jobs have been killed. The issue is being investigated.

Fri Mar 20 13:15:33 EDT 2020: There was a power glitch at the datacentre at 8:50 AM, which resulted in jobs getting killed. Please resubmit failed jobs.

COVID-19 Impact on SciNet Operations, March 18, 2020

Although the University of Toronto is closing of some of its research operations on Friday March 20 at 5 pm EDT, this does not affect the SciNet systems (such as Niagara, Mist, and HPSS), which will remain operational.

SciNet/Niagara Downtime Announcement, March 25-26, 2020

All resources at SciNet will undergo a two-day maintenance shutdown on March 25th and 26th 2020, starting at 7 am EDT on Wednesday March 25th. There will be no access to any of the SciNet systems (Niagara, Mist, HPSS, Teach cluster, or the file systems) during this time.

This shutdown is necessary to finish the expansion of the Niagara cluster and its storage system.

We expect to be able to bring the systems back online the evening of March 26th.

March 9, 2020, 11:24 PM: HPSS services are temporarily suspended for emergency maintenance.

March 7, 2020, 10:15 PM: File system issues have been cleared.

March 6, 2020, 7:30 PM: File system issues; we are investigating

March 2, 2020, 1:30 PM: For the extension of Niagara, the operating system on all Niagara nodes has been upgraded from CentOS 7.4 to 7.6. This required all nodes to be rebooted. Running compute jobs are allowed to finish before the compute node gets rebooted. Login nodes have all been rebooted, as have the datamover nodes and the jupyterhub service.

Feb 24, 2020, 1:30PM: The Mist login node got rebooted. It is back, but we are still monitoring the situation.

Feb 12, 2020, 11:00AM: The Mist GPU cluster now available to users.

Feb 11, 2020, 2:00PM: The Niagara compute nodes were accidentally rebooted, killing all running jobs.

Feb 10, 2020, 19:00PM: HPSS is back to normal.

Jan 30, 2020, 12:01PM: We are having an issue with HPSS, in which the disk-cache is full. We put a reservation on the whole system (Globus, plus archive and vfs queues), until it has had a chance to clear some space on the cache.

Jan 21, 2020, 4:05PM: The was a partial power outage the took down a large amount of the compute nodes. If your job died during this period please resubmit.

Jan 13, 2020, 7:35 PM: Maintenance finished.

Jan 13, 2020, 8:20 AM: The announced maintenance downtime started (see below).

Jan 9 2020, 11:30 AM: External ssh connectivity restored, issue related to the university network.

Jan 9 2020, 9:24 AM: We received reports of users having trouble connecting into the SciNet data centre; we're investigating. Systems are up and running and jobs are fine.

As a work around, in the meantime, it appears to be possible to log into graham, cedar or beluga, and then ssh to niagara.

Downtime announcement: To prepare for the upcoming expansion of Niagara, there will be a one-day maintenance shutdown on January 13th 2020, starting at 8 am EST. There will be no access to Niagara, Mist, HPSS or teach, nor to their file systems during this time.

2019

December 13, 9:00 AM EST: Issues resolved.

December 13, 8:20 AM EST: Overnight issue is now preventing logins to Niagara and other services. Possibly a file system issue, we are investigating.

Fri, Nov 15 2019, 11:00 PM (EST) Niagara and most of the main systems are now available.

Fri, Nov 15 2019, 7:50 PM (EST) SOSCIP GPU cluster is up and accessible. Work on the other systems continues.

Fri, Nov 15 2019, 5:00 PM (EST) Infrastructure maintenance done, upgrades still in process.

Fri, Nov 15 2019, 7:00 AM (EST) Maintenance shutdown of the SciNet data centre has started. Note: scratch purging has been postponed until Nov 17.

Announcement: The SciNet datacentre will undergo a maintenance shutdown on Friday November 15th 2019, from 7 am to 11 pm (EST), with no access to any of the SciNet systems (Niagara, P8, SGC, HPSS, Teach cluster, or the filesystems) during that time. Sat, Nov 2 2019, 1:30 PM (update): Chiller has been fixed, all systems are operational.

Fri, Nov 1 2019, 4:30 PM (update): We are operating in free cooling so have brought up about 1/2 of the Niagara compute nodes to reduce the cooling load. Access, storage, and other systems should now be available.

Fri, Nov 1 2019, 12:05 PM (update): A power module in the chiller has failed and needs to be replaced. We should be able to operate in free cooling if the temperature stays cold enough, but we may not be able to run all systems. No eta yet on when users will be able to log back in.

Fri, Nov 1 2019, 9:15 AM (update): There was a automated shutdown because of rising temperatures, causing all systems to go down. We are investigating, check here for updates.

Fri, Nov 1 2019, 8:16 AM: Unexpected data centre issue: Check here for updates.

Thu 1 Aug 2019 5:00:00 PM Systems are up and operational.

Thu 1 Aug 2019 7:00:00 AM: Scheduled Downtime Maintenance of the SciNet Datacenter. All systems will be down and unavailable starting 7am until the evening.

Fri 26 Jul 2019, 16:02:26 EDT: There was an issue with the Burst Buffer at around 3PM, and it was recently solved. BB is OK again.

Sun 30 Jun 2019 The SOSCIP BGQ and P7 systems were decommissioned on June 30th, 2019. The BGQdev front end node and storage are still available.

Wed 19 Jun 2018, 1:20:00 PM: The BGQ is back online.

Wed 19 Jun 2018, 10:00:00 AM: The BGQ is still down, the SOSCIP GPU nodes should be back up.

Wed 19 Jun 2018, 1:40:00 AM: There was an issue with the SOSCIP BGQ and GPU Cluster last night about 1:42am, probably a power fluctuation that took it down.

Wed 12 Jun 2019, 3:30 AM - 7:40 AM Intermittent system issues on Niagara's project and scratch as the file number limit was reached. We increased the number of files allowed in total on the file system.

Thu 30 May 2019, 11:00:00 PM: The maintenance downtime of SciNet's data center has finished, and systems are being brought online now. You can check the progress here. Some systems might not be available until Friday morning.
Some action on the part of users will be required when they first connect again to a Niagara login nodes or datamovers. This is due to the security upgrade of the Niagara cluster, which is now in line with currently accepted best practices.
The details of the required actions can be found on the SSH Changes in May 2019 wiki page.

Wed 29-30 May 2019 The SciNet datacentre will undergo a two-day maintenance shutdown, starting at 7 am EDT on Wednesday May 29th. There will be no access to any of the SciNet systems (Niagara, P7, P8, BGQ, SGC, HPSS, Teach cluster, or the file systems) during this time.

SCHEDULED SHUTDOWN:

Please be advised that on Wednesday May 29th through Thursday May 30th, the SciNet datacentre will undergo a two-day maintenance shutdown, starting at 7 am EDT on Wednesday May 29th. There will be no access to any of the SciNet systems (Niagara, P7, P8, BGQ, SGC, HPSS, Teach cluster, or the file systems) during this time.

This is necessary to finish the installation of an emergency power generator, to perform the annual cooling tower maintenance, and to enhance login security.

We expect to be able to bring the systems back online the evening of May 30th. Due to the enhanced login security, the ssh applications of users will need to update their known host list. More detailed information on this procedure will be sent shortly before the systems are back online.

Fri 5 Apr 2019: Software updates on Niagara: The default CCEnv software stack now uses avx512 on Niagara, and there is now a NiaEnv/2019b stack ("epoch").

Thu 4 Apr 2019: The 2019 compute and storage allocations have taken effect on Niagara.

NOTE: There is scheduled network maintenance for Friday April 26th 12am-8am on the Scinet datacenter external network connection. This will not affect internal connections and running jobs however remote connections may see interruptions during this period.

Wed 24 Apr 2019 14:14 EDT: HPSS is back on service. Library and robot arm maintenance finished.

Wed 24 Apr 2019 08:35 EDT: HPSS out of service this morning for library and robot arm maintenance.

Fri 19 Apr 2019 17:40 EDT: HPSS robot arm has been released and is back to normal operations.

Fri 19 Apr 2019 14:00 EDT: problems with HPPS library robot have been detected.

Wed 17 Apr 2019 15:35 EDT: Network connection is back.

Wed 17 Apr 2019 15:12 EDT: Network connection down. Investigating.

Tue 9 Apr 2019 22:24:14 EDT: Network connection restored.

Tue 9 Apr 2019, 15:20: Network connection down. Investigating.

Fri 5 Apr 2019: Planned, short outage in connectivity to the SciNet datacentre from 7:30 am to 8:55 am EST for maintenance of the network. This outage will not affect running or queued jobs. It may be necessary to reboot the login nodes at some point tomorrow, which could result in a short interruption of connectivity, but which will have no effect on running or queued jobs.

April 4, 2019: The 2019 compute and storage allocations will take effect on Niagara. Running jobs will not be affected by this change and will run their course. Queued jobs' priorities will be updated to reflect the new fairshare values later in the day. The queue should fully reflect the new fairshare values in about 24 hours.

It may be necessary to reboot the login nodes at some point tomorrow, which could result in a short interruption of connectivity, but which will have no effect on running or queued jobs.

There will be updates to the software stack on this day as well.

March 25, 3:05 PM EST: Most systems back online, other services should be back shortly.

March 25, 12:05 PM EST: Power is back at the datacentre, but it is not yet known when all systems will be back up. Keep checking here for updates.

March 25, 11:27 AM EST: A power outage in the datacentre occured and caused all services to go down. Check here for updates.

Thu Mar 21 10:37:28 EDT 2019: HPSS is back in service

HPSS out of service on Tue, Mar/19 at 9AM, for tape library expansion and relocation. It's possible the downtime will extend to Wed, Mar/20.

January 21, 4:00 PM: HPSS is back in service. Thank you for your patience.

January 18, 5:00 PM: We did practically all of the HPSS upgrades (software/hardware), however the main client node - archive02 - is presenting an issue we just couldn't resolve yet. We will try to resume work over the weekend with cool heads, or on Monday. Sorry, but this is an unforeseen delay. Jobs on the queue we'll remain there, and we'll delay the scratch purging by 1 week.

January 16, 11:00 PM: HPSS is being upgraded, as announced.

January 16, 8:00 PM: System are coming back up and should be accessible for users now.

January 15, 8:00 AM: Data centre downtime in effect.

Downtime Announcement for January 15 and 16, 2019
The SciNet datacentre will need to undergo a two-day maintenance shutdown in order to perform electrical work, repairs and maintenance. The electrical work is in preparation for the upcoming installation of an emergency power generator and a larger UPS, which will result in increased resilience to power glitches and outages. The shutdown is scheduled to start on Tuesday January 15, 2019, at 7 am and will last until Wednesday 16, 2019, some time in the evening. There will be no access to any of the SciNet systems (Niagara, P7, P8, BGQ, SGC, HPSS, Teach cluster, or the filesystems) during this time. Check back here for up-to-date information on the status of the systems.

Note: this downtime was originally scheduled for Dec. 18, 2018, but has been postponed and combined with the annual maintenance downtime.

December 24, 2018, 11:35 AM EST: Most systems are operational again. If you had compute jobs running yesterday at around 3:30PM, they likely crashed - please check them and resubmit if needed.

December 24, 2018, 10:40 AM EST: Repairs have been made, and the file systems are starting to be mounted on the cluster.

December 23, 2018, 3:38 PM EST: Issues with the file systems (home, scratch and project). We are investigating, it looks like a hardware issue that we are trying to work around. Note that the absence of /home means you cannot log in with ssh keys. All compute jobs crashed around 3:30 PM EST on Dec 23. Once the system is properly up again, please resubmit your jobs. Unfortunately, at this time of year, it is not possible to give an estimate on when the system will be operational again.

Tue Nov 22 14:20:00 EDT 2018: HPSS back in service

Tue Nov 22 08:55:00 EDT 2018: HPSS offline for scheduled maintenance

Tue Nov 20 16:30:00 EDT 2018: HPSS offline on Thursday 9AM for installation of new LTO8 drives in the tape library.

Tue Oct 9 12:16:00 EDT 2018: BGQ compute nodes are up.

Sun Oct 7 20:24:26 EDT 2018: SGC and BGQ front end are available, BGQ compute nodes down related to a cooling issue.

Sat Oct 6 23:16:44 EDT 2018: There were some problems bringing up SGC & BGQ, they will remain offline for now.

Sat Oct 6 18:36:35 EDT 2018: Electrical work finished, power restored. Systems are coming online.

July 18, 2018: login.scinet.utoronto.ca is now disabled, GPC $SCRATCH and $HOME are decommissioned.

July 12, 2018: There was a short power interruption around 10:30 am which caused most of the systems (Niagara, SGC, BGQ) to reboot and any running jobs to fail.

July 11, 2018: P7's moved to BGQ filesystem, P8's moved to Niagara filesystem.

May 24, 2018, 9:25 PM EST: The data center is up, and all systems are operational again.

May 24, 2018, 7:00 AM EST: The data centre is under annual maintenance. All systems are offline. Systems are expected to be back late afternoon today; check for updates on this page.

May 18, 2018: Announcement: Annual scheduled maintenance downtime: Thursday May 24, starting 7:00 AM

May 16, 2018: Cooling restored, systems online

May 16, 2018: Cooling issue at datacentre again, all systems down

May 15, 2018: Cooling restored, systems coming online

May 15, 2018 Cooling issue at datacentre, all systems down

May 4, 2018: HPSS is now operational on Niagara.

May 3, 2018: Burst Buffer is available upon request.

May 3, 2018: The Globus endpoint for Niagara is available: computecanada#niagara.

May 1, 2018: System status moved he here.

Apr 23, 2018: GPC-compute is decommissioned, GPC-storage available until 30 May 2018.

April 10, 2018: Niagara commissioned.