Google said the data center hosting one of its London cloud regions suffered “simultaneous failure of multiple, redundant cooling systems” during the UK’s recent record heatwave.
“On Tuesday, 19 July 2022 at 06:33 US/Pacific, a simultaneous failure of multiple, redundant cooling systems in one of the data centers that hosts the zone europe-west2-a impacted multiple Google Cloud services. This resulted in some customers experiencing service unavailability for impacted products,” Google said in a recent update to the incident report.
“To our customers whose businesses were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps (detailed in the Remediation & Prevention section below) to improve the region's resilience.”
Google said that during the recent UK heatwave, one of the data centers that hosts the europe-west2-a zone could not maintain a safe operating temperature due to the cooling failure combined with the extreme temperatures outside, so shut down the facility to prevent further damage.
The company didn’t disclose the nature of the failure, but said its engineers are conducting an analysis of the system that triggered this incident and will be auditing cooling system equipment and standards across the data centers that house Google Cloud globally.
“We powered down this part of the zone to prevent an even longer outage or damage to machines. This caused a partial failure of capacity in that zone, leading to instance terminations, service degradation, and networking issues for a subset of customers.”
The company said a number of regional Google Cloud services experienced impact during this incident due to the fact its team “inadvertently modified traffic routing” for internal services to avoid all three zones in the europe-west2 region, rather than just the impacted europe-west2-a zone.
Regional storage services, including GCS and BigQuery, replicate customer data across multiple zones. Due to the regional traffic routing change, they were unable to access any replica for a number of storage objects and preventing customers from reading these objects while the routing error was in place.
As a result of the incident, Google said it would repair and “carefully re-test” its failover automation.
It also said it would investigate and develop “more advanced methods” to progressively decrease the thermal load within a single data center space, reducing the probability that a full shutdown is required.
Guy’s and St Thomas’ NHS Foundation Trust Chief chief digital information officer Beverley Bryant explained in an internal video call seen by the BBC that the hospitals’ IT systems were knocked out by "ludicrous heat" leading to a failure in the data center's air-conditioning. She said: "The servers couldn't handle the heat and they collapsed in an unmanaged and uncoordinated way."