Overheating at a Microsoft data center operating some of its cloud services, including Outlook, led to services being lost for close to 16 hours from the afternoon of March 12 to morning of March 13, US time.
Microsoft said the outage occurred as it went through a regular process of updating firmware on a core part of its physical plant.
“This failure resulted in a rapid and substantial temperature spike in the data center,” Microsoft blogger and corporate VP of test and service engineering Arthur de Hann said.
“This spike was significant enough before it was mitigated that it caused our safeguards to come in to place for a large number of servers in this part of the data center.”
He said the failure was “unexpected” and required both infrastructure software and human intervention to bring the core infrastructure, in a physical region of on its data center, back online.
“Requiring this kind of human intervention is not the norm for our services and added significant time to the restoration,” de Hann said.
Microsoft is known for its distributed cloud computing environment, which operates from numerous modular data centers in geographically disperse locations.
It also relies heavily on software for the automation of cloud loads and management of its data center environment.
Microsoft said a number of users of Hotmail, Outlook and Skydrive, Microsoft’s image sharing service, were affected. The event occurred as it was rolling out an upgrade for Outlook.
The outage occurred on March 12, 13:35pm PDT and services were back up at 5:43am PDT on March 13.
De Hann said teams worked throughout the evening to get Microsoft’s operations back up, and had a number of impacted mailboxes fully restored by midnight and the rest by 5:390am.
“Outages are something we take very seriously and invest a significant amount of our time and energy in doing our best to prevent,” de Hahan said.