Some systems are experiencing issues
Maintenance
Matrix maintenance

The Matrix databases are undergoing regular maintenance. All Matrix services may experience degraded performance or small outages.

Hornet: PHP8.1 removal

PHP 8.1 will be removed from Hornet. Please migrate to PHP 8.3 as soon as possible.

Past Incidents

2019-09-09

GitLab GitLab security update

GitLab has released a security update. Due to its severity SNT's instance will be upgraded immediately.

2019-08-30

Some services remained down after network disruption.

At 12:50 CEST there was short but major network disruption. Most services continued to function after the disruption however, some remain down among which the cloud management web interface and cPanel.

At 15:05 CEST all issues were resolved.

2019-08-29

Many SNT hosted services are down.

About two thirds of all VMs hosted in the SNT cloud is currently down. Among the down VMs are many of SNT's services as well as some VMs from vcolo customers.

Update 2019-08-29 18:21 CEST: all services should be operational

2019-08-14

GitLab GitLab security update

GitLab has released a security update. Due to its severity SNT's instance will be upgraded immediately.

2019-01-10

GitLab GitLab upgrade issues

Some components, like pipelines (CI/CD), do not work after the upgrade. We're looking into it.

Update 2019-01-11 11:32: The outage was caused by a failed database migration. This, in turn, was caused by the upgraded Redis daemon listening on a different UNIX socket. The latter was resolved quickly after the GitLab update to 11.5, but the former was not spotted due to the large amount of output produced by the GitLab upgrade.

2019-01-04

GitLab runners CI outage

Because of some unforeseen problems encountered when migrating some internal systems, the Gitlab Runners are currently offline. We are working to resolve this as soon as possible.

Update 23:10 - the incident has been revolved.

2018-11-08

Cloud Cloud platform disk failure

A disk failure occurred on the cloud platform. The disk has been replaced.

2018-10-31

Cloud Virtualization platform outage

Virtualization platform outage

There was an outage affecting some services, on the cloud platform caused by an incompatibility between oVirt and newly acquired 12 TB disks. WESP (our old web, email and storage platform) was down during this disruption. HORNET (our new web, email and storage platform) was unaffected.

Post mortem

The outage started on 2018-10-31 15:40 CET during an upgrade of the storage pools undelying the virtualization platform (oVirt). oVirt saves its data on gluster 'bricks', which are backed by an XFS filesystem on LVM. The HDD storage was extended by adding 12 TB disks to all nodes of the cluster, and extending the logical volume and filesystem underlying the HDD storage brick.

The last node crashed during the extension of this operation and the filesystem had to be re-created. Not all virtual machines could be moved away before the machine was 'fenced' from the rest of the cluster. These virtual machined remained down while recovery was in progress because of the 'unknown' state of the HDD storage domain, which was designated as the master domain by oVirt. After some time (see timeline) the master domain was forced to the (unaffected) SSD storage, allowing all downed virtual machines without HDD storage to resume operations.

While rebuilding a large amount of relatively vague errors were thrown by oVirt: spmstatusvds failed (22, sanlock resource read failure, invalid argument). These were found to be caused by the fact that the new 12 TB disks had 4K native sectors only.

Because shrinking XFS filesystems is not possible, the filesystem was destroyed after which Gluster can 'heal' (rebuild) the files to the correct state. This was expected to last a few hours up to a day.

Gluster, however, could not enter a healed state because some data was already (failing to be) written to the new disks. A recovery by moving the filesystem to (borrowed) 512 byte sector drives resolved the issue on 2018-11-03 12:20 CET.


Causes and contributing factors:

  • The XFS filesystem on the last node (krentenwegge) was created with 512 byte sectors. Extending this to a disk with 4K sectors results in a crash.

  • oVirt (sanlock/vdsm) only supports disks with 512 byte sectors.

  • oVirt designated the HDD storage pool as 'master', requiring functionality before start. It is not possible to manually override this in a straight-forward fashion.

  • XFS does not support shrinking.

  • XFS spreads data over its filesystem.

  • Errors in oVirt are relatively easy to miss.


Learning points:

  • Be careful with 4K native disks. These may be supported, but this is never guaranteed. Check every action twice with these disks.

  • Be conservative enlarging a filesystem that cannot be shrunk.

  • Actively monitor the platform while performing even routine maintenance tasks.


Approximate timeline:

  • 2018-10-31 14:30: Start of upgrade of nodes.

  • 2018-10-31 15:39: Last node (krentewegge) fails during the resize of the logical volume.

  • 2018-10-31 15:40: Two virtual machines using HDD space on krentewegge (WESP and Kronos) are paused.

  • 2018-10-31 16:01: Virtual machines are migrated away from krentewegge.

  • 2018-10-31 16:02: Krentewegge is fenced from the rest of the cluster. Datacenter is in unknown state.

  • 2018-10-31 16:10: Krentewegge is accessed via IPMI and recovery starts.

  • 2018-10-31 17:15: Heal of entire HDD volume starts, ETA 2 hours.

  • 2018-10-31 18:58: Datacenter is still in unknown state due to various errors: spmstatusvds failed (22, sanlock resource read failure, invalid argument)

  • 2018-10-31 20:33: Possible cause found to be 4K native sectors: https://bugzilla.redhat.com/show_bug.cgi?id=1386443

  • 2018-10-31 21:01: Recovery plan is to delete the logical volumes and recreate them on the old (512 sector) drives only, as filesystem shrinking is not possible with XFS. This should be done on all nodes, one by one. This is started on krentewegge. ETA: next morning.

  • 2018-11-01 07:29: First heal (of 3) at only 1.9 / 2.5 TiB.

  • 2018-11-01 11:07: First heal (of 3) at only 2.4 / 2.5 TiB and slowing progress.

  • 2018-11-01 12:07: Datacenter is forced online by setting the SSD storage domain to master via internal commands. All VMs except for WESP and Kronos recover.

  • 2018-11-01 15:55: HDD images for stuck VMs are being backed up to the backup server.

  • 2018-11-01 19:00: 512e disks are ordered to replace the 4Kn disks with pvmove.

  • 2018-11-01 22:00: Heal does not seem to complete because some data made it onto the 4Kn drives on mergpijp. The VMs don't start for the same reason.

  • 2018-11-02 01:25: New disks may not arrive in a timely fashion. Plea for help is sent to LISA.

  • 2018-11-02 09:25: Offer to borrow 6x 4TB disks is made.

  • 2018-11-02 09:45: Disks are received.

  • 2018-11-02 11:00: All borrowed disks have been placed and configured. Migration started.

  • 2018-11-02 12:16: Migration is done. VMs are being started.

  • 2018-11-03 12:20: Outage is resolved.


Outage

There is an outage affecting some services on the cloud platform, caused by an incompatibility between oVirt and newly acquired 12 TB disks.

Updates:

  • 2018-11-03 12:20 Outage is resolved.

  • 2018-11-02 12:16 Migration is done. VMs are being started.

  • 2018-11-02 10:46 Migration is almost done (95%) and estimated to finish in the next hour.

  • 2018-11-02 19:55 Migration is at 35%. Updated estimate: 14 hours remaining

  • 2018-11-02 15:05 Migration of data is still in progress. Rough estimate: 20 hours remaining.

  • 2018-11-02 11:00 Migration of data to new disks is in progress.

  • 2018-11-02 09:50 We've borrowed some disks from the University's IT department. They will be used to replace the disks causing the outage.

WESP outage

The issue affecting the cloud platform has caused a major outage for WESP.