On Friday, we have expanded the available disk space on our SSD CEPH cluster in Haarlem. We've done this many times before without any customer problems, so we didn't plan a maintenance for it.
Unfortunately, this time the expansion did lead to customer impact. Adding the new disks to the cluster has led to stuck placement groups, which are basically hanging disk operations. During this time, virtual machines have experienced issues when trying to write or read from their Big Disk (network attach block storage).
Immediately, we detected the issue and started working to address it. We restarted all nodes that had stuck placement groups, which resolved the issue within two hours.
The reason the addition of disk space led to hanging disk operations seems to be a combination of adding many resources at once, combined with a bug in the software. We have changed our procedures to add resources in smaller chunks to prevent future outages.
Of course, all of our hardware is fully redundant; this particular CEPH cluster has its data spread over 170 disks, spread over 7 nodes. All nodes have redundant power sources and network uplinks. Unfortunately, this doesn't counter logical software bugs as the one we ran into.
We apologize for any inconvenience this issue has caused, and we are confident this will not reoccur.