system news

Severe problems with the PFS filesystem (updated 2019-06-24 11:35)

  • Posted on: 14 June 2019
  • By: torkel

We are currently experiencing severe problems with the pfs (parallel file system).

Both Kebnekaise and Abisko are affected, including access to the PFS filesystem from the login nodes. Either it takes a very long time to access the files or the files are not available. We recommend that you try to avoid using the PFS filesystem. 

It is working intensively to solve the problems. At the moment we have no ETA when the problems will be resolved.

Note: The batch queues for Kebnekaise and Abisko are stopped until further notice.

*RESOLVED* pfs file system slow/down, 2019-04-04

  • Posted on: 4 April 2019
  • By: nikke

2019-04-04:

We are experiencing severe slowdown on the /pfs/nobackup file system, affecting all accesses including running jobs.

This is caused by components in the storage system restarting for unknown reasons, investigation is ongoing.

*UPDATE* In order to identify what is going on we are forced to shut down the file system occasionally. The vendor is assisting in identifying and fixing the issue.

Unplanned cluster issues due to switchboard failure

  • Posted on: 15 March 2019
  • By: zao

2019-03-29:

Normal power routing restored to all nodes.

 

2019-03-28:

Repairs completed, switchboard powered up.

 

2019-03-25:

Due to component delivery delays the final steps of repair are postponed. The new date for completion is Wednesday 2019-03-27.

 

2019-03-20:

Replacement parts and cables are en route. We currently estimate installation and recertification of the switchboard to be finished by the end of Monday 2019-03-25.

 

2019-03-15:

2019-03-13 - Power outage at HPC2N. *Clusters back up*

  • Posted on: 13 March 2019
  • By: roger

There was a (25-minute) power outage at Umeå University campus just before 09:30. This brought down our clusters (Kebnekaise and Abisko and killed all running jobs. It also has affected the Kebnekaise login nodes.

Power is back now, and we are in the process of taking up the clusters. We will add more information when we know what happened and/or when the clusters are back up.

*Update* The reason for the power outage was a severed cable that triggered a circuit breaker in one of the universities internal power stations.

Maintenance window, monday 2019-03-04 09:00 - 13:00, all clusters affected

  • Posted on: 25 February 2019
  • By: ake

On Monday 2019-03-04 09:00 we have a maintenance window to replace parts of the /pfs/nobackup file system.

We expect it to take a couple of hours.

All clusters will be affected, and jobs that have a requested timelimit that reaches beyond that point in time will not be allowed to start until after the service.

Lost power to the clusters. *Power and clusters back*

  • Posted on: 30 January 2019
  • By: roger

At around 19:20 today (30/1), the power to the clusters was lost. That means both kebnekaise and abisko is not running any jobs. All running jobs has has stopped. We are working on getting back the power and the cluster up. This message will be updates has we know more.

Login to the kebnekaise access nodes will stall until the kebnekaise cluster is back up. Abisko login node should work.

*Update* 2019-01-30 22:40: Abisko should be up and running again. Kebnekaise has some problems with /pfs and is still down. Probably will not be fixed until tomorrow.

Pages

Updated: 2024-11-01, 13:56