Virtualization platform outage

There was an outage affecting some services, on the cloud platform caused by an incompatibility between oVirt and newly acquired 12 TB disks. WESP (our old web, email and storage platform) was down during this disruption. HORNET (our new web, email and storage platform) was unaffected.

Post mortem

The outage started on 2018-10-31 15:40 CET during an upgrade of the storage pools undelying the virtualization platform (oVirt). oVirt saves its data on gluster 'bricks', which are backed by an XFS filesystem on LVM. The HDD storage was extended by adding 12 TB disks to all nodes of the cluster, and extending the logical volume and filesystem underlying the HDD storage brick.

The last node crashed during the extension of this operation and the filesystem had to be re-created. Not all virtual machines could be moved away before the machine was 'fenced' from the rest of the cluster. These virtual machined remained down while recovery was in progress because of the 'unknown' state of the HDD storage domain, which was designated as the master domain by oVirt. After some time (see timeline) the master domain was forced to the (unaffected) SSD storage, allowing all downed virtual machines without HDD storage to resume operations.

While rebuilding a large amount of relatively vague errors were thrown by oVirt: spmstatusvds failed (22, sanlock resource read failure, invalid argument). These were found to be caused by the fact that the new 12 TB disks had 4K native sectors only.

Because shrinking XFS filesystems is not possible, the filesystem was destroyed after which Gluster can 'heal' (rebuild) the files to the correct state. This was expected to last a few hours up to a day.

Gluster, however, could not enter a healed state because some data was already (failing to be) written to the new disks. A recovery by moving the filesystem to (borrowed) 512 byte sector drives resolved the issue on 2018-11-03 12:20 CET.

Causes and contributing factors:

The XFS filesystem on the last node (krentenwegge) was created with 512 byte sectors. Extending this to a disk with 4K sectors results in a crash.
oVirt (sanlock/vdsm) only supports disks with 512 byte sectors.
oVirt designated the HDD storage pool as 'master', requiring functionality before start. It is not possible to manually override this in a straight-forward fashion.
XFS does not support shrinking.
XFS spreads data over its filesystem.
Errors in oVirt are relatively easy to miss.

Learning points:

Be careful with 4K native disks. These may be supported, but this is never guaranteed. Check every action twice with these disks.
Be conservative enlarging a filesystem that cannot be shrunk.
Actively monitor the platform while performing even routine maintenance tasks.

Approximate timeline:

2018-10-31 14:30: Start of upgrade of nodes.
2018-10-31 15:39: Last node (krentewegge) fails during the resize of the logical volume.
2018-10-31 15:40: Two virtual machines using HDD space on krentewegge (WESP and Kronos) are paused.
2018-10-31 16:01: Virtual machines are migrated away from krentewegge.
2018-10-31 16:02: Krentewegge is fenced from the rest of the cluster. Datacenter is in unknown state.
2018-10-31 16:10: Krentewegge is accessed via IPMI and recovery starts.
2018-10-31 17:15: Heal of entire HDD volume starts, ETA 2 hours.
2018-10-31 18:58: Datacenter is still in unknown state due to various errors: spmstatusvds failed (22, sanlock resource read failure, invalid argument)
2018-10-31 20:33: Possible cause found to be 4K native sectors: https://bugzilla.redhat.com/show_bug.cgi?id=1386443
2018-10-31 21:01: Recovery plan is to delete the logical volumes and recreate them on the old (512 sector) drives only, as filesystem shrinking is not possible with XFS. This should be done on all nodes, one by one. This is started on krentewegge. ETA: next morning.
2018-11-01 07:29: First heal (of 3) at only 1.9 / 2.5 TiB.
2018-11-01 11:07: First heal (of 3) at only 2.4 / 2.5 TiB and slowing progress.
2018-11-01 12:07: Datacenter is forced online by setting the SSD storage domain to master via internal commands. All VMs except for WESP and Kronos recover.
2018-11-01 15:55: HDD images for stuck VMs are being backed up to the backup server.
2018-11-01 19:00: 512e disks are ordered to replace the 4Kn disks with pvmove.
2018-11-01 22:00: Heal does not seem to complete because some data made it onto the 4Kn drives on mergpijp. The VMs don't start for the same reason.
2018-11-02 01:25: New disks may not arrive in a timely fashion. Plea for help is sent to LISA.
2018-11-02 09:25: Offer to borrow 6x 4TB disks is made.
2018-11-02 09:45: Disks are received.
2018-11-02 11:00: All borrowed disks have been placed and configured. Migration started.
2018-11-02 12:16: Migration is done. VMs are being started.
2018-11-03 12:20: Outage is resolved.

Outage

There is an outage affecting some services on the cloud platform, caused by an incompatibility between oVirt and newly acquired 12 TB disks.

Updates:

2018-11-03 12:20 Outage is resolved.
2018-11-02 12:16 Migration is done. VMs are being started.
2018-11-02 10:46 Migration is almost done (95%) and estimated to finish in the next hour.
2018-11-02 19:55 Migration is at 35%. Updated estimate: 14 hours remaining
2018-11-02 15:05 Migration of data is still in progress. Rough estimate: 20 hours remaining.
2018-11-02 11:00 Migration of data to new disks is in progress.
2018-11-02 09:50 We've borrowed some disks from the University's IT department. They will be used to replace the disks causing the outage.

Virtualization platform outage 2018-10-31 14:40:00 (Europe/Amsterdam)

Virtualization platform outage

Post mortem

Outage