system news

/pfs/nobackup filesystem now back in production

  • Posted on: 28 April 2016
  • By: admin

The filesystem is now back in production.

As far as we can tell no files or directories where lost, but if you do find evidence of that, please notify support@hpc2n.umu.se, so we can report it to the vendor.

We're sorry about the long downtime, but we were verifying each step with the vendor, to do our best not to loose any data.

Mon, 2015-08-24 11:32 | Åke Sandgren

Electrical maintenance causing problems

  • Posted on: 28 April 2016
  • By: admin

The electrical maintenance that was scheduled for 19:00 on Wednesday evening did not go without problems.

Something went slightly wrong causing one of the UPS:es to fail. This in turn resulted in the cooling system failing, causing a quick rise of the temperature and a following emergency cut of power.

This caused so much problems that we will not be able to get things back on line until tomorrow (Thursday).

We lost, among other things, the /pfs/nobackup filesystem, which is the reason that the queues have been stopped. We expect that jobs have failed.

Scheduled electrical maintenance, Wed Aug 19 09:00-14:00

  • Posted on: 28 April 2016
  • By: admin

On Wednesday August 19 09:00-14:00 all compute clusters will be unavailable due to scheduled maintenance of the high-voltage electrical infrastructure.

No compute jobs extending into the downtime window will start, consider scheduling shorter jobs if possible to enable maximum utilization of the systems before the downtime.

Tue, 2015-08-11 09:33 | Niklas Edmundsson

PFS down (resolved)

  • Posted on: 28 April 2016
  • By: admin

As is tradition lately, PFS is again inaccessible and attempts to use it will hang.
Batch queues have been suspended and we're poking the storage system.

Update 2015-08-08 21:39 CEST:
The filesystem is back online and the queues have been resumed.

Fri, 2015-08-07 15:05 | Lars Viklund

PFS inaccessible

  • Posted on: 28 April 2016
  • By: admin

We are again having some problems with the pfs system. File access is slow, or hanging. We are investigating and more information will follow as soon as we have it.

The queues have been paused and will be resumed when access to pfs has been restored.

Update 12:26: The PFS system is back online and all queues have been resumed. Sorry about any inconvenience the problem caused. If you notice further problems with the pfs, please report it to support@hpc2n.umu.se.

Problem with pfs

  • Posted on: 28 April 2016
  • By: admin

We are currently experiencing some problems with the pfs system. File access is slow, or hanging. We are currently looking for the cause. More information will follow as soon as we have it.

The queues have been paused and will be resumed when access to pfs has been restored.

Update 20:44: The PFS system is back online and all queues have been resumed. We apologize for any inconvenience. If you notice any further problems wiht the PFS file system please report it to support@hpc2n.umu.se

Driver issues with Abisko interconnect.

  • Posted on: 28 April 2016
  • By: admin

We are experiencing some driver issues with the infiband interconnect. This might lead to jobs failing with things like "Invalid resource type" and "Invalid CQ event".

We are in the process of updating the drivers, and we hope that this will solve the issues. Since we are unwilling to abort running jobs, the exact time when this update will be fully finished is not known. But we do expect it to be done before the end of the week.

Mon, 2015-07-06 09:38 | Roger Oskarsson

The /pfs/nobackup problems from 2015-05-27 are currently solved

  • Posted on: 28 April 2016
  • By: admin

The problems we had with the /pfs/nobackup filesystem is currently solved.

We are still waiting for the vendor to tell us exactly what happened and how to make sure it doesn't happen again.

But for the time being things are expected to be back to normal.

Some jobs may have failed due to the timeouts that resulted from the problem but as far as we can tell at the moment no files have been lost.

We apologize for this and will try to minimize the risk of something like this happening again.

Pages

Updated: 2024-11-01, 13:56