Previous messages

From SciNet Users Documentation
Revision as of 14:28, 24 July 2019 by Northrup (talk | contribs)
Jump to navigation Jump to search

Sun 30 Jun 2019 The SOSCIP BGQ and P7 systems were decommissioned on June 30th, 2019. The BGQdev front end node and storage are still available.

Wed 19 Jun 2018, 1:20:00 PM: The BGQ is back online.

Wed 19 Jun 2018, 10:00:00 AM: The BGQ is still down, the SOSCIP GPU nodes should be back up.

Wed 19 Jun 2018, 1:40:00 AM: There was an issue with the SOSCIP BGQ and GPU Cluster last night about 1:42am, probably a power fluctuation that took it down.

Wed 12 Jun 2019, 3:30 AM - 7:40 AM Intermittent system issues on Niagara's project and scratch as the file number limit was reached. We increased the number of files allowed in total on the file system.

Thu 30 May 2019, 11:00:00 PM: The maintenance downtime of SciNet's data center has finished, and systems are being brought online now. You can check the progress here. Some systems might not be available until Friday morning.
Some action on the part of users will be required when they first connect again to a Niagara login nodes or datamovers. This is due to the security upgrade of the Niagara cluster, which is now in line with currently accepted best practices.
The details of the required actions can be found on the SSH Changes in May 2019 wiki page.

Wed 29-30 May 2019 The SciNet datacentre will undergo a two-day maintenance shutdown, starting at 7 am EDT on Wednesday May 29th. There will be no access to any of the SciNet systems (Niagara, P7, P8, BGQ, SGC, HPSS, Teach cluster, or the file systems) during this time.

SCHEDULED SHUTDOWN:

Please be advised that on Wednesday May 29th through Thursday May 30th, the SciNet datacentre will undergo a two-day maintenance shutdown, starting at 7 am EDT on Wednesday May 29th. There will be no access to any of the SciNet systems (Niagara, P7, P8, BGQ, SGC, HPSS, Teach cluster, or the file systems) during this time.

This is necessary to finish the installation of an emergency power generator, to perform the annual cooling tower maintenance, and to enhance login security.

We expect to be able to bring the systems back online the evening of May 30th. Due to the enhanced login security, the ssh applications of users will need to update their known host list. More detailed information on this procedure will be sent shortly before the systems are back online.

Fri 5 Apr 2019: Software updates on Niagara: The default CCEnv software stack now uses avx512 on Niagara, and there is now a NiaEnv/2019b stack ("epoch").

Thu 4 Apr 2019: The 2019 compute and storage allocations have taken effect on Niagara.

NOTE: There is scheduled network maintenance for Friday April 26th 12am-8am on the Scinet datacenter external network connection. This will not affect internal connections and running jobs however remote connections may see interruptions during this period.


Wed 24 Apr 2019 14:14 EDT: HPSS is back on service. Library and robot arm maintenance finished.

Wed 24 Apr 2019 08:35 EDT: HPSS out of service this morning for library and robot arm maintenance.

Fri 19 Apr 2019 17:40 EDT: HPSS robot arm has been released and is back to normal operations.

Fri 19 Apr 2019 14:00 EDT: problems with HPPS library robot have been detected.

Wed 17 Apr 2019 15:35 EDT: Network connection is back.

Wed 17 Apr 2019 15:12 EDT: Network connection down. Investigating.

Tue 9 Apr 2019 22:24:14 EDT: Network connection restored.

Tue 9 Apr 2019, 15:20: Network connection down. Investigating.

Fri 5 Apr 2019: Planned, short outage in connectivity to the SciNet datacentre from 7:30 am to 8:55 am EST for maintenance of the network. This outage will not affect running or queued jobs. It may be necessary to reboot the login nodes at some point tomorrow, which could result in a short interruption of connectivity, but which will have no effect on running or queued jobs.


April 4, 2019: The 2019 compute and storage allocations will take effect on Niagara. Running jobs will not be affected by this change and will run their course. Queued jobs' priorities will be updated to reflect the new fairshare values later in the day. The queue should fully reflect the new fairshare values in about 24 hours.

It may be necessary to reboot the login nodes at some point tomorrow, which could result in a short interruption of connectivity, but which will have no effect on running or queued jobs.

There will be updates to the software stack on this day as well.

March 25, 3:05 PM EST: Most systems back online, other services should be back shortly.

March 25, 12:05 PM EST: Power is back at the datacentre, but it is not yet known when all systems will be back up. Keep checking here for updates.

March 25, 11:27 AM EST: A power outage in the datacentre occured and caused all services to go down. Check here for updates.

Thu Mar 21 10:37:28 EDT 2019: HPSS is back in service

HPSS out of service on Tue, Mar/19 at 9AM, for tape library expansion and relocation. It's possible the downtime will extend to Wed, Mar/20.

January 21, 4:00 PM: HPSS is back in service. Thank you for your patience.

January 18, 5:00 PM: We did practically all of the HPSS upgrades (software/hardware), however the main client node - archive02 - is presenting an issue we just couldn't resolve yet. We will try to resume work over the weekend with cool heads, or on Monday. Sorry, but this is an unforeseen delay. Jobs on the queue we'll remain there, and we'll delay the scratch purging by 1 week.

January 16, 11:00 PM: HPSS is being upgraded, as announced.

January 16, 8:00 PM: System are coming back up and should be accessible for users now.

January 15, 8:00 AM: Data centre downtime in effect.

  • Downtime Announcement for January 15 and 16, 2019

The SciNet datacentre will need to undergo a two-day maintenance shutdown in order to perform electrical work, repairs and maintenance. The electrical work is in preparation for the upcoming installation of an emergency power generator and a larger UPS, which will result in increased resilience to power glitches and outages. The shutdown is scheduled to start on Tuesday January 15, 2019, at 7 am and will last until Wednesday 16, 2019, some time in the evening. There will be no access to any of the SciNet systems (Niagara, P7, P8, BGQ, SGC, HPSS, Teach cluster, or the filesystems) during this time. Check back here for up-to-date information on the status of the systems.

Note: this downtime was originally scheduled for Dec. 18, 2018, but has been postponed and combined with the annual maintenance downtime.

  • December 24, 2018, 11:35 AM EST: Most systems are operational again. If you had compute jobs running yesterday at around 3:30PM, they likely crashed - please check them and resubmit if needed.
  • December 24, 2018, 10:40 AM EST: Repairs have been made, and the file systems are starting to be mounted on the cluster.
  • December 23, 2018, 3:38 PM EST: Issues with the file systems (home, scratch and project). We are investigating, it looks like a hardware issue that we are trying to work around. Note that the absence of /home means you cannot log in with ssh keys. All compute jobs crashed around 3:30 PM EST on Dec 23. Once the system is properly up again, please resubmit your jobs. Unfortunately, at this time of year, it is not possible to give an estimate on when the system will be operational again.
  • Tue Nov 22 14:20:00 EDT 2018: HPSS back in service
  • Tue Nov 22 08:55:00 EDT 2018: HPSS offline for scheduled maintenance
  • Tue Nov 20 16:30:00 EDT 2018: HPSS offline on Thursday 9AM for installation of new LTO8 drives in the tape library.
  • Tue Oct 9 12:16:00 EDT 2018: BGQ compute nodes are up.
  • Sun Oct 7 20:24:26 EDT 2018: SGC and BGQ front end are available, BGQ compute nodes down related to a cooling issue.
  • Sat Oct 6 23:16:44 EDT 2018: There were some problems bringing up SGC & BGQ, they will remain offline for now.
  • Sat Oct 6 18:36:35 EDT 2018: Electrical work finished, power restored. Systems are coming online.
  • July 18, 2018: login.scinet.utoronto.ca is now disabled, GPC $SCRATCH and $HOME are decommissioned.
  • July 12, 2018: There was a short power interruption around 10:30 am which caused most of the systems (Niagara, SGC, BGQ) to reboot and any running jobs to fail.
  • July 11, 2018: P7's moved to BGQ filesystem, P8's moved to Niagara filesystem.
  • May 24, 2018, 9:25 PM EST: The data center is up, and all systems are operational again.
  • May 24, 2018, 7:00 AM EST: The data centre is under annual maintenance. All systems are offline. Systems are expected to be back late afternoon today; check for updates on this page.
  • May 18, 2018: Announcement: Annual scheduled maintenance downtime: Thursday May 24, starting 7:00 AM
  • May 16, 2018: Cooling restored, systems online
  • May 16, 2018: Cooling issue at datacentre again, all systems down
  • May 15, 2018: Cooling restored, systems coming online
  • May 15, 2018: Cooling issue at datacentre, all systems down
  • May 4, 2018: HPSS is now operational on Niagara.
  • May 3, 2018: Burst Buffer is available upon request.
  • May 3, 2018: The Globus endpoint for Niagara is available: computecanada#niagara.
  • May 1, 2018: System status moved here.
  • Apr 23, 2018 GPC-compute is decommissioned, GPC-storage available until 30 May 2018.
  • April 10, 2018: Niagara commissioned.