Microsoft: error during capacity expansion led to Azure cloud outage

Archived Content

The following content is from an older version of this website, and may not display correctly.

The outage of Microsoft’s cloud service for some customers in Western Europe on 26 July happened because the company’s engineers had expanded capacity of one compute cluster but forgot to make all the necessary configuration adjustments in the network infrastructure.

“Windows Azure’s network infrastructure uses a safety valve mechanism to protect against potential cascading networking failures by limiting the scope of connections that can be accepted by our data center network hardware devices,” Mike Neil, general manager for Windows Azure (Microsoft’s cloud-services brand), wrote in a blog post.

In response to an increase in demand for the West Europe sub-region of its cloud infrastructure, technicians added new capacity to a cluster serving customers in this sub-region. What they did not do was configure the “safety-valve mechanism” Neil described to accept more connections to match this new capacity.

As soon as capacity was added, usage in the cluster quickly increased, exceeding the old connection threshold, which generated a lot of network-management messages. All this “management traffic” triggered buts in some of the hardware in the cluster.

The bugs pushed the devices to max out CPU (central processing unit) utilization, impacting data traffic.

Microsoft has now increased the limit settings in the affected cluster, as well as across all Azure data centers, Neil wrote. The company is also fixing bugs in the device software.

Finally, the cloud provider has improved its network-monitoring systems to catch similar connectivity issues before they affect running services.

Only Azure’s compute service hosted on the offending cluster in Europe was affected, Neil wrote. No other regions or services were affected by the 2.5-hour outage.

The data center infrastructure supporting Windows Azure consists of eight nodes in Europe, eight in the US and eight more across Asia-Pacific, Qatar and Brazil.

Microsoft: error during capacity expansion led to Azure cloud outage

Archived Content

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies

Success story: Kao Data and Cadence