Virtualization platform outage 2018-10-31 14:40:00 (Europe/Amsterdam)


Virtualization platform outage

There was an outage affecting some services, on the cloud platform caused by an incompatibility between oVirt and newly acquired 12 TB disks. WESP (our old web, email and storage platform) was down during this disruption. HORNET (our new web, email and storage platform) was unaffected.

Post mortem

The outage started on 2018-10-31 15:40 CET during an upgrade of the storage pools undelying the virtualization platform (oVirt). oVirt saves its data on gluster 'bricks', which are backed by an XFS filesystem on LVM. The HDD storage was extended by adding 12 TB disks to all nodes of the cluster, and extending the logical volume and filesystem underlying the HDD storage brick.

The last node crashed during the extension of this operation and the filesystem had to be re-created. Not all virtual machines could be moved away before the machine was 'fenced' from the rest of the cluster. These virtual machined remained down while recovery was in progress because of the 'unknown' state of the HDD storage domain, which was designated as the master domain by oVirt. After some time (see timeline) the master domain was forced to the (unaffected) SSD storage, allowing all downed virtual machines without HDD storage to resume operations.

While rebuilding a large amount of relatively vague errors were thrown by oVirt: spmstatusvds failed (22, sanlock resource read failure, invalid argument). These were found to be caused by the fact that the new 12 TB disks had 4K native sectors only.

Because shrinking XFS filesystems is not possible, the filesystem was destroyed after which Gluster can 'heal' (rebuild) the files to the correct state. This was expected to last a few hours up to a day.

Gluster, however, could not enter a healed state because some data was already (failing to be) written to the new disks. A recovery by moving the filesystem to (borrowed) 512 byte sector drives resolved the issue on 2018-11-03 12:20 CET.


Causes and contributing factors:


Learning points:


Approximate timeline:


Outage

There is an outage affecting some services on the cloud platform, caused by an incompatibility between oVirt and newly acquired 12 TB disks.

Updates: