Main Page

From SciNet Users Documentation
Revision as of 22:08, 26 June 2026 by Willis2 (talk | contribs)
Jump to navigation Jump to search

System Status

Trillium OnDemand Globus
HPSS Balam S4H
Teach File system External Network

Fri Jun 26, 2026, 5:00 PM: Most systems are being brought back up.

Fri Jun 26, 2026, 4:00 PM: We upgraded Globus and HPSS to the latest software stack:

  • alliancecan#trillium endpoint is working fine, no issues
  • alliancecan#hpss currently we detected a bug, that makes the endpoint half broken:
    • It will ingest only the first file of a sequence (sometimes the first 2), but nothing after that.
    • It will create a whole directory tree, but with only 1 file, the rest of the tree is empty.
    • Recalling multiple files or trees out of hpss is it working fine (this is the good half)
      • We have identified a temporary workaround for users who need to transfer data before a permanent fix is available:
      • Go to the "Transfer & Timer Options" in the middle of the file transfer pane.
      • find there the box to check called "do NOT verify file integrity after transfer" (forth checkbox down)
      • then you will be able to ingest multiple files and whole trees

Thu Jun 25, 2026, 3:39 PM: Due to an unexpected technical issue, the SciNet systems can not yet be brought up today (except for the login nodes). We are working to have the systems back up tomorrow.

Tue Jun 23, 2026, 1:34 PM: Trillium login nodes and datamover nodes are up again, as is the OnDemand interface; This is mainly to give users access to their files; no jobs can be submittted for at least another day.

Trillium AI Expansion Installation Maintenance Shutdown Schedule

  1. May 26/27, 2026: Shutdown of the Trillium compute nodes and HPSS, starting at 4 AM EDT on May 26th. The Trillium login nodes as well as OnDemand, the Teach cluster, and Balam, will remain available during this maintenance. HPSS will be back in service later on the same day (May 26th), while the Trillium compute nodes are expected to be back in service at some time on May 27th.
  2. June 9/10, 2026: A multi-day chiller maintenance; this will involve a shutdown of all systems.
  3. June 22-25, 2026: A 4-day full datacentre shutdown. The SciNet datacentre will undergo maintenance of several critical parts of the centre. This will require a full shutdown of all SciNet systems (Trillium, Balam, S4H, HPSS, Teach, as well as hosted equipment).
  4. July 13-24, 2026 (tentative): A two-week long shutdown of Trillium compute nodes will be required. During most of this time the Trillium login nodes, storage and Open Ondemand will remain up, except for a one-day power shutdown within this period (currently planned for July 22nd).

More information and precise dates for the last maintenance shutdown will be announced later.

Wed Apr 30, 2026, 3:00 pm: System have been updated to mitigate known security risks, and are back in service. Note that no actual security breaches were found.

Wed Apr 29, 2026, 5:25 pm: For security reasons, login access to all systems has been disabled, as have OnDemand Apps. Compute jobs are still allowed to run.

Thu Apr 23, 2026, 10:00 am: The Trillium file system has issues and may be slow on certain nodes. We are still investigating.

Previous messages

QuickStart Guides

Tutorials, Manuals, etc.