Partial outage of vColo platform 2022-08-13 19:59:00 (Europe/Amsterdam)

Performance issues lead to a large outage and data loss for a number of virtual machines. The core issues have been resolved, but some services will have to be restored from backup. The status page will be updated as services are restored.

See Cloud outage of August 2022 for details.

The steps we took yesterday have resolved the issue. For more detail we have written an overview here.

Note that some SNT services will have to be restored from backup. This includes DAS, Jitsi and the password reset portal for the vColo web interface. For a complete overview, see the current outages on the dashboard. We will restore those in the next few weeks.

Please contact us if you need any help.

Posted 3 years ago by silke

All failing SSDs have been replaced. Additionally, we have identified an issue in the configuration of the hypervisors that may have caused the I/O issues.

Unfortunatly, the problems did come back in the last few days. The most recent flare-up has caused additional issues, for which the following still applies:

not all customer data has survived: due to compounding failures in our VM platform SSD data on some vColos (including our own) are lost. Affected customers will be contacted and have been given new VMs with any recovered data mounted. Contact us for unrepaired disk images. We recommend to activate your Disaster Recovery Plan using these new VMs.

Note that we can only identify data loss on the VM platform itself. The issues may have caused corruption in the VMs (either directly, or as a side-effect) which we are unable to identify directly. Contact us at cloud@snt.utwente.nl if you want a new disk for recovery, or if you want a snapshot before recovery is attempted.

We will watch the to see if all issues have been resolved, and will start restoring all affected services tomorrow if possible.

Posted 3 years ago by silke

Most of the severe performance issues have been resolved. However, they may flare up again. At this time we suspect that the issues are caused by (somewhat) failing SSDs, which are being replaced.

Unfortunately, not all customer data has survived: due to compounding failures in our VM platform SSD data on some vColos (including our own) are lost. Affected customers will be contacted and have been given new VMs with any recovered data mounted. Contact us for unrepaired disk images. We recommend to activate your Disaster Recovery Plan using these new VMs.

Posted 3 years ago by silke

The issues with our cloud platform are still unresolved. Troubleshooting will continue tomorrow.

Posted 3 years ago by silke

The issues with our cloud platform are still unresolved. Troubleshooting will continue tomorrow.

Posted 3 years ago by silke

We are continuing to manage the cloud platform outage. Many services should be available, though with degraded performance. We will update this incident as the situation develops.

Posted 3 years ago by silke