Resolved Maintenance
Maintenance: Ceph Storage
Start: 2026-03-12 07:00 CET
End: 2026-03-12 15:00 CET
Duration: 8 h 0 min
Over the past week we have seen recurring I/O‑blocking incidents in our Ceph cluster. Our investigations point to overloaded I/O queues on three RAID controllers of the same type. When large numbers of I/O requests are released from the buffers simultaneously, the controller’s queue can dead‑lock, which leads to the blocked requests you have experienced.
Counter‑measures:
We have already reduced the load on the affected controllers. Furthermore, we will optimize buffer draining and empty the buffers more continuously, which will smooth the flow of I/O toward the controllers. Additional threads will be assigned to distribute the load more evenly across RAID‑controller queues. We will also limit the queue depth of SSD devices to avoid overloads.
The maintenance will take place during regular business hours because the issue only manifests under I/O patterns that occurs during daytime operations. We have not observed any blocked I/O requests outside of those hours. This gives us the best opportunity to detect any side‑effects early and to mitigate them before they impact your workloads.
Please note: We cannot guarantee that they will completely eliminate the dead‑lock condition. We will, however, monitor the outcome closely and be prepared to take further steps if necessary.