Big Disk slow in Amsterdam

Incident Report for Tilaa

Postmortem

Starting at 11:35 CET, the performance of Big Disk in Amsterdam was severely degraded.

This is caused by a disk unexpectedly breaking, causing the entire Amsterdam SSD Big Disk cluster to re-calibrate in an unexpected way.

As the issue was caused by a bug in the storage platform(CEPH), information about a possible timeline was unavailable.

Most VPS's were completely unable to connect to the Big Disk causing issues during boot as well.

Luckily, there was no risk of data loss.

We worked with subject matter experts to resolve the issue more swiftly.

As of 17:45 CET, limited traffic between a VPS and a Big Disk was possible again.

As of 19:00 CET, performance was considered fully restored.

Posted Oct 17, 2024 - 09:08 CEST

Resolved

Performance has been restored.
We will still continue monitoring the status of the Amsterdam SSD Big Disk cluster and are discussing actions to prevent this from happening again in the future.
Please contact our Support team if you still experience issues.

Posted Oct 17, 2024 - 09:08 CEST

Monitoring

Performance has been restored.
We are actively monitoring the status of the Amsterdam SSD Big Disk cluster.
Please contact our Support team if you still experience issues.

Posted Oct 16, 2024 - 19:13 CEST

Update

Although performance is still degraded, it is now possible to have limited traffic between a VPS and a Big Disk.
Performance is steadily increasing.
We are still actively working to restore full performance.
As soon as more information is available or by 20:00 CET at the latest we will provide another update.

Posted Oct 16, 2024 - 18:32 CEST

Update

We are continuing to work on a fix for this issue.

Posted Oct 16, 2024 - 18:11 CEST

Update

Although performance is still degraded, it is now possible to have limited traffic between a VPS and a Big Disk.
Rebooting a VPS that is attached to a Big Disk can now be performed again with only minor delays.
We are still actively working to restore full performance.
As soon as more information is available or by 18:30 CET at the latest we will provide another update.

Posted Oct 16, 2024 - 18:04 CEST

Update

Unfortunately, the performance of Big Disk in Amsterdam is still severely degraded.
No new updates are available yet.
We are continuing to work on solving the problem as fast as possible alongside subject matter experts.
As soon as more information is available or by 18:00CET we will provide another update.

Posted Oct 16, 2024 - 17:29 CEST

Update

Unfortunately, the performance of Big Disk in Amsterdam is still severely degraded.
No detailed timeline is available still.
We are working with subject matter experts to help restore the cluster as fast as possible.

As soon as more information is available or by 17:30CET we will provide another update.

Posted Oct 16, 2024 - 17:00 CEST

Update

Unfortunately, no new update is available yet.
We are still actively working to restore the cluster to a usable condition as fast as possible.
As soon as an update is available or by 17:00CET we will provide another update.

Posted Oct 16, 2024 - 16:34 CEST

Update

Unfortunately, the performance of Big Disk in Amsterdam is still severely degraded.
No detailed timeline is available yet.
We are working to provide you with a timeline as fast as possible.
As soon as more information is available or by 16:30CET we will provide another update.

Posted Oct 16, 2024 - 16:05 CEST

Update

Unfortunately, the performance of Big Disk in Amsterdam is still severely degraded.
This is caused by a disk unexpectedly breaking, causing the entire cluster to re-calibrate in an unexpected way.
Rebooting a VPS that is attached to a Big Disk can cause issues during the boot as it will try to connect to the impacted cluster.

We are actively working to restore the cluster as fast as possible.
Thank you for your patience.
As soon as more information is available or by 16:00 CET at the latest we will send another update.

Posted Oct 16, 2024 - 15:25 CEST

Update

The fix that we have implemented is working, we see that the IOPS slowly are returning to their normal state.
Before everything is completely up and running, we suspect it's going to take 1 hour, we apologize for this inconvenience.

Posted Oct 16, 2024 - 13:42 CEST

Identified

There are issues with severely degraded performance of Big Disk in Amsterdam. The cause has been identified and we are implementing a fix.

Posted Oct 16, 2024 - 12:11 CEST

This incident affected: Data Center (Amsterdam) and Services (NFS).