Difference between revisions of "Previous messages"

From SciNet Users Documentation
Jump to: navigation, search
Line 1: Line 1:
 +
 +
'''NOTE''':  There is scheduled network maintenance for '''Friday April 26th 12am-8am''' on the Scinet datacenter external network connection.  This will not affect internal connections and running jobs however remote connections may see interruptions during this period.
 +
 +
 +
Wed 24 Apr 2019 14:14 EDT: HPSS is back on service. Library and robot arm maintenance finished.
 +
 +
Wed 24 Apr 2019 08:35 EDT: HPSS out of service this morning for library and robot arm maintenance.
 +
 +
Fri 19 Apr 2019 17:40 EDT: HPSS robot arm has been released and is back to normal operations.
 +
 +
Fri 19 Apr 2019 14:00 EDT: problems with HPPS library robot have been detected.
 +
 +
Wed 17 Apr 2019 15:35 EDT: Network connection is back.
 +
 +
Wed 17 Apr 2019 15:12 EDT: Network connection down.  Investigating.
 +
 +
Tue 9 Apr 2019 22:24:14 EDT:  Network connection restored.
 +
 +
Tue 9 Apr 2019, 15:20: Network connection down.  Investigating.
 +
 
April 4, 2019:  The 2019 compute and storage allocations will take effect on Niagara. Running jobs will not be affected by this change and will run their course.  Queued jobs' priorities will be updated to reflect the new fairshare values later in the day.  The queue should fully reflect the new fairshare values in about 24 hours.   
 
April 4, 2019:  The 2019 compute and storage allocations will take effect on Niagara. Running jobs will not be affected by this change and will run their course.  Queued jobs' priorities will be updated to reflect the new fairshare values later in the day.  The queue should fully reflect the new fairshare values in about 24 hours.   
  

Revision as of 11:58, 26 April 2019

NOTE: There is scheduled network maintenance for Friday April 26th 12am-8am on the Scinet datacenter external network connection. This will not affect internal connections and running jobs however remote connections may see interruptions during this period.


Wed 24 Apr 2019 14:14 EDT: HPSS is back on service. Library and robot arm maintenance finished.

Wed 24 Apr 2019 08:35 EDT: HPSS out of service this morning for library and robot arm maintenance.

Fri 19 Apr 2019 17:40 EDT: HPSS robot arm has been released and is back to normal operations.

Fri 19 Apr 2019 14:00 EDT: problems with HPPS library robot have been detected.

Wed 17 Apr 2019 15:35 EDT: Network connection is back.

Wed 17 Apr 2019 15:12 EDT: Network connection down. Investigating.

Tue 9 Apr 2019 22:24:14 EDT: Network connection restored.

Tue 9 Apr 2019, 15:20: Network connection down. Investigating.

April 4, 2019: The 2019 compute and storage allocations will take effect on Niagara. Running jobs will not be affected by this change and will run their course. Queued jobs' priorities will be updated to reflect the new fairshare values later in the day. The queue should fully reflect the new fairshare values in about 24 hours.

It may be necessary to reboot the login nodes at some point tomorrow, which could result in a short interruption of connectivity, but which will have no effect on running or queued jobs.

There will be updates to the software stack on this day as well.

March 25, 3:05 PM EST: Most systems back online, other services should be back shortly.

March 25, 12:05 PM EST: Power is back at the datacentre, but it is not yet known when all systems will be back up. Keep checking here for updates.

March 25, 11:27 AM EST: A power outage in the datacentre occured and caused all services to go down. Check here for updates.

Thu Mar 21 10:37:28 EDT 2019: HPSS is back in service

HPSS out of service on Tue, Mar/19 at 9AM, for tape library expansion and relocation. It's possible the downtime will extend to Wed, Mar/20.

January 21, 4:00 PM: HPSS is back in service. Thank you for your patience.

January 18, 5:00 PM: We did practically all of the HPSS upgrades (software/hardware), however the main client node - archive02 - is presenting an issue we just couldn't resolve yet. We will try to resume work over the weekend with cool heads, or on Monday. Sorry, but this is an unforeseen delay. Jobs on the queue we'll remain there, and we'll delay the scratch purging by 1 week.

January 16, 11:00 PM: HPSS is being upgraded, as announced.

January 16, 8:00 PM: System are coming back up and should be accessible for users now.

January 15, 8:00 AM: Data centre downtime in effect.

  • Downtime Announcement for January 15 and 16, 2019

The SciNet datacentre will need to undergo a two-day maintenance shutdown in order to perform electrical work, repairs and maintenance. The electrical work is in preparation for the upcoming installation of an emergency power generator and a larger UPS, which will result in increased resilience to power glitches and outages. The shutdown is scheduled to start on Tuesday January 15, 2019, at 7 am and will last until Wednesday 16, 2019, some time in the evening. There will be no access to any of the SciNet systems (Niagara, P7, P8, BGQ, SGC, HPSS, Teach cluster, or the filesystems) during this time. Check back here for up-to-date information on the status of the systems.

Note: this downtime was originally scheduled for Dec. 18, 2018, but has been postponed and combined with the annual maintenance downtime.

  • December 24, 2018, 11:35 AM EST: Most systems are operational again. If you had compute jobs running yesterday at around 3:30PM, they likely crashed - please check them and resubmit if needed.
  • December 24, 2018, 10:40 AM EST: Repairs have been made, and the file systems are starting to be mounted on the cluster.
  • December 23, 2018, 3:38 PM EST: Issues with the file systems (home, scratch and project). We are investigating, it looks like a hardware issue that we are trying to work around. Note that the absence of /home means you cannot log in with ssh keys. All compute jobs crashed around 3:30 PM EST on Dec 23. Once the system is properly up again, please resubmit your jobs. Unfortunately, at this time of year, it is not possible to give an estimate on when the system will be operational again.
  • Tue Nov 22 14:20:00 EDT 2018: HPSS back in service
  • Tue Nov 22 08:55:00 EDT 2018: HPSS offline for scheduled maintenance
  • Tue Nov 20 16:30:00 EDT 2018: HPSS offline on Thursday 9AM for installation of new LTO8 drives in the tape library.
  • Tue Oct 9 12:16:00 EDT 2018: BGQ compute nodes are up.
  • Sun Oct 7 20:24:26 EDT 2018: SGC and BGQ front end are available, BGQ compute nodes down related to a cooling issue.
  • Sat Oct 6 23:16:44 EDT 2018: There were some problems bringing up SGC & BGQ, they will remain offline for now.
  • Sat Oct 6 18:36:35 EDT 2018: Electrical work finished, power restored. Systems are coming online.
  • July 18, 2018: login.scinet.utoronto.ca is now disabled, GPC $SCRATCH and $HOME are decommissioned.
  • July 12, 2018: There was a short power interruption around 10:30 am which caused most of the systems (Niagara, SGC, BGQ) to reboot and any running jobs to fail.
  • July 11, 2018: P7's moved to BGQ filesystem, P8's moved to Niagara filesystem.
  • May 24, 2018, 9:25 PM EST: The data center is up, and all systems are operational again.
  • May 24, 2018, 7:00 AM EST: The data centre is under annual maintenance. All systems are offline. Systems are expected to be back late afternoon today; check for updates on this page.
  • May 18, 2018: Announcement: Annual scheduled maintenance downtime: Thursday May 24, starting 7:00 AM
  • May 16, 2018: Cooling restored, systems online
  • May 16, 2018: Cooling issue at datacentre again, all systems down
  • May 15, 2018: Cooling restored, systems coming online
  • May 15, 2018: Cooling issue at datacentre, all systems down
  • May 4, 2018: HPSS is now operational on Niagara.
  • May 3, 2018: Burst Buffer is available upon request.
  • May 3, 2018: The Globus endpoint for Niagara is available: computecanada#niagara.
  • May 1, 2018: System status moved here.
  • Apr 23, 2018 GPC-compute is decommissioned, GPC-storage available until 30 May 2018.
  • April 10, 2018: Niagara commissioned.