Persistent lightning takes out some recent writes in persistent storage
Lightning strikes affecting a Google data center in Belgium have apparently caused some Persistent Disk storage customers to lose data.
A tiny fraction of data stored on Persistent Disks in Google’s europe-west1-b zone has been lost, after repeated lightning strikes hit power systems at a data center where some storage systems were susceptible to power interruptions. It is understood that this zone is served from Google’s St Ghislain data center in Belgium, and Google has said that permanent data loss occurred in less than one millionth of a percent (0.000001%) of the zone’s persistent disk space.
Google St Ghislain Belgium
“This outage is wholly Google’s responsibility,” says Google’s incident report. ”However, we would like to take this opportunity to highlight an important reminder for our customers: GCE [Google Compute Engine] instances and Persistent Disks within a zone exist in a single Google datacenter and are therefore unavoidably vulnerable to datacenter-scale disasters.”
Google is upgrading to replace the storage systems with more resilient ones, but suggests that customers who want higher availability should use “GCE snapshots and Google Cloud Storage as resilient, geographically replicated repositories” for their data.
On Thursday afternoon four successive lightning strikes hit the electrical systems of a European datacenter, and caused a brief loss of power to storage systems serving the europe-west1-b zone. DNS lookups showed some time back that this zone is physically located in Belgium, so it seems this took place at the St Ghislain data center, which Google built in 2009, and is its first to operate without mechanical cooling.
Power was quickly restored, and the storage systems have battery backup but Google says data was lost because ”some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain”.
As a result, from Thursday to Monday, around five percent of the zone’s GCE instances received I/O errors from their Persistent Disks. Most of this data was recoverable, because the storage systems had saved the data to stable storage, but manual intervention was needed to restore it, and by Monday only a very few cases remained where some recently written data was unrecoverable, leading to permanent data loss, Google reports.
Google says it is already upgrading its storage hardware and most of its Persistent Disk storage already uses systems which are less susceptible to the power failure mode that triggered this incident, and this work will continue.