The servers were thinking they ran out of flash storage
Google has revealed that the lengthy outage experienced by the customers of its Compute Engine in the US at the end of June was caused by a software glitch that manifested during SSD maintenance.
The glitch made the servers perceive all disks as full even when they were empty, causing elevated latency and errors for most writes that involved flash storage.
“We would like to apologize for the length and severity of this incident. We are taking immediate steps to prevent a recurrence and improve reliability in the future,” the company said on its status page.
You are running very low on disk space
Problems started on June 28 around 18:15 PDT, with some customers in the US seeing errors when they were trying to write data to SSDs or create new disks. Compute instances using SSD as their root disk were also being unresponsive.
Turns out that two separate, pre-scheduled maintenance events that were supposed to rebalance the data across Google’s distributed storage system clashed with each other. Due to a software bug, they would erase disk blocks but not release them back into circulation. As a result, the software depleted the available SSD space until writes were rejected.
According to Google, intermittent outages lasted about 211 minutes and were only fixed when engineers reverted one of the maintenance events that triggered the issue.
“To reduce downtime related to similar issues in future, Google engineers are refining automated monitoring such that, if this issue were to recur, engineers would be alerted before users saw impact,” the company said in a statement.
“We are also improving our automation to better coordinate different maintenance operations on the same zone to reduce the time it takes to revert such operations if necessary.”