Performance degradation with Block Storage

Incident Report for Tilaa

Postmortem

On Friday, we have expanded the available disk space on our SSD CEPH cluster in Haarlem. We've done this many times before without any customer problems, so we didn't plan a maintenance for it.

Unfortunately, this time the expansion did lead to customer impact. Adding the new disks to the cluster has led to stuck placement groups, which are basically hanging disk operations. During this time, virtual machines have experienced issues when trying to write or read from their Big Disk (network attach block storage).

Immediately, we detected the issue and started working to address it. We restarted all nodes that had stuck placement groups, which resolved the issue within two hours.

The reason the addition of disk space led to hanging disk operations seems to be a combination of adding many resources at once, combined with a bug in the software. We have changed our procedures to add resources in smaller chunks to prevent future outages.

Of course, all of our hardware is fully redundant; this particular CEPH cluster has its data spread over 170 disks, spread over 7 nodes. All nodes have redundant power sources and network uplinks. Unfortunately, this doesn't counter logical software bugs as the one we ran into.

We apologize for any inconvenience this issue has caused, and we are confident this will not reoccur.

Posted May 25, 2023 - 10:13 CEST

Resolved

We have monitored the situation and didn't see the problem return.
The expanding of the disk size has been complete, the system is now working on the backlog after that it will be healthy again.

Posted May 12, 2023 - 18:05 CEST

Update

We are continuing to monitor for any further issues.

Posted May 12, 2023 - 12:53 CEST

Update

The expansion of our CEPH cluster has inadvertently led to hanging operations in many of our CEPH nodes. We had to manually stop these operations, which took us longer than we had hoped. We have now concluded our recovery steps, and we'll keep monitoring the system for another few hours.

Posted May 12, 2023 - 12:52 CEST

Monitoring

We are addressing the issue that we found, and machines are coming back up slowly, but surely. We will remain vigilant and keep you up to date through this page.

Posted May 12, 2023 - 12:09 CEST

Update

We have found that only our SSD cluster has been affected. The issue seems related to the expansion of the disk space in that cluster. We have been working on adding additional storage capacity for the last few days.

Posted May 12, 2023 - 10:25 CEST

Investigating

Some customers have reported issues with their network attached block storage (Big Disk) specifically in Haarlem. We are currently investigating the issue and will update this page when we learn more.

Posted May 12, 2023 - 10:12 CEST

This incident affected: Data Center (Haarlem).