On Monday, September 14, a Microsoft Azure data center experienced an extended outage due to a cooling incident.
The UK South facility was brought offline when multiple chilled water pumps shut down for reasons unknown, forcing Microsoft to bring the rest of the facility to a standstill to stop temperatures rising precipitously. The data center was unavailable between 13:54 UTC and 00:41 UTC.
Everything went south
"A cooling loss event occurred when multiple chilled water pumps shut down," Microsoft Azure said on its status page.
"This resulted in cooling loss and the internal temperatures for some parts in a single data center began to rise above the operational thresholds in UK South. Automation began shutting down the network, compute, and storage resources to protect data durability."
Site engineers then placed the cooling system into manual mode and began to reset the affected pumps to recover the cooling plant. "This helped to bring temperatures to safe operational ranges in all the impacted areas of the data center by 16:40 UTC," Microsoft said.
"Once temperatures were within safe thresholds, engineers started to restore power to the affected infrastructure and began a phased approach to bringing this infrastructure back online. Once storage and the networking infrastructure was fully restored, dependent compute scale units began to recover. As compute scale units became healthy, virtual machines and other dependent Azure services recovered."
Among those impacted by the outage was the UK government's Covid-19 information portal, as first reported by The Register. "We are monitoring the situation closely and will update the website as soon as the services are restored," the government said at the time.
The outage came on the same day as Microsoft pulled its Project Natick data center up from the ocean floor. Relying on the outside environment for cooling, and filled with nitrogen gas, the facility proved eight times more reliable than an on-land replica.