Microsoft has released the preliminary analysis of the significant disruption to its Azure cloud services that began on September 4 and lasted several days.
The outage, originally caused by lightning strikes near the company's South Central US data centers in Texas, affected users around the world, and also impacted Microsoft services like Skype and Office 365.
The cloud is just somebody else's data center
"In the early morning of September 4, 2018, high energy storms hit southern Texas in the vicinity of Microsoft Azure’s South Central US region. Multiple Azure data centers in the region saw voltage sags and swells across the utility feeds. At 08:42 UTC, lightning caused electrical activity on the utility supply, which caused significant voltage swells," the preliminary report states.
This caused one Azure data center to transfer from utility power to generators, and the mechanical cooling systems to shut down "despite having surge suppressors in place."
"Initially, the data center was able to maintain its operational temperatures through a load dependent thermal buffer that was designed within the cooling system. However, once this thermal buffer was depleted the data center temperature exceeded safe operational thresholds, and an automated shutdown of devices was initiated."
While this shutdown was meant to protect the infrastructure and servers, "temperatures increased so quickly in parts of the data center that some hardware was damaged before it could shut down. A significant number of storage servers were damaged, as well as a small number of network devices and power units."
Onsite teams then switched the rest of the data center to generators to stabilize the power supply and began working on recovering the Azure Software Load Balancers (SLBs) for storage scale units, which manage the routing of both customer and platform service traffic.
The next step was to recover the storage servers and the data held on them, by replacing failed infrastructure components, migrating customer data from the damaged servers to healthy servers, and validating that none of the recovered data was corrupted. "This process took time due to the number of servers damaged, and the need to work carefully to maintain customer data integrity above all else," the company said.
During this process, Microsoft decided not to fail over to another data center "since a fail over would have resulted in limited data loss due to the asynchronous nature of geo replication."
While the lightning strikes only affected Texas, "this particular set of issues also caused a cascading impact to services outside of the region."
When it rains, it pours
Microsoft detailed six different areas that were impacted as a result of the outage in addition to those in South Central:
Azure Service Manager (ASM), which had "insufficient resiliency"; Azure Active Directory (AAD), which was meant to reroute to other sites but struggled under the increased rate in authentication requests; Visual Studio Team Services (VSTS), which had some capabilities only hosted in South Central; Azure Application Insights, which depended on the Azure Active Directory; Azure subscription management, which "experienced five separate issues;" and the Azure status page, which - as we noted at the time - was unavailable for the majority of the outage. This, Microsoft said, was due to "the combination of the increased traffic and incorrect auto-scale configuration settings."