Microsoft's slow outage recovery in Sydney due to insufficient staff on site

Only three people on site at time of power outage

Microsoft has revealed that the recovery from its recent Sydney, Australia, outage was impacted by the shortage of staff on site.

In the company's post-outage analysis, the cloud giant revealed that the night team on-site at the data center housing its Australia East region was as few as three people.

Microsoft stated that: "The staffing of the team at night was insufficient to restart the chillers in a timely manner. We have temporarily increased the team size from three to seven until the underlying issues are better understood and appropriate mitigations can be put in place."

The cloud giant explained the cause of the outage was due to the failure of chillers in the data hall. The hall holds seven chillers: five in operation and two on standby. At the time of the power sag, which was amid a thunderstorm, all five chillers were brought offline, and only one of the redundancy chillers was successfully brought on.

In the analysis, Microsoft explained this failure: "The five chillers did not manage to restart because the corresponding pumps did not get the run signal from the chillers. This is important as it is integral to the successful restarting of the chiller units. We are partnering with our OEM vendor to investigate why the chillers did not command their respective pump to start."

Because of the extended time without cooling, the hardware was turned off to protect it from heat damage. Ultimately, this also brought down the temperature of the chillers which enabled them to start working again, and at this point, Azure began turning its compute and storage units back online. In total, approximately half of Azure's Australia East Cosmos DC clusters went down or were heavily degraded.

In addition to the shortage of staff, Azure identified that the automation systems had failed to work effectively. The company stated that, in the future, they will look to improve this, stating that "The EOP for restarting chillers is slow to execute for an event with such a significant blast radius. We are exploring ways to improve existing automation to be more resilient to various voltage sag event types."

According to the analysis, including Azure, seven tenants were impacted: "Five standard storage tenants, and two premium storage tenants."

Microsoft's slow outage recovery in Sydney due to insufficient staff on site

More in Outages

Why you need a Digital Twin of your Data Centers

Copper vandals bring down 500ft cell tower in Oklahoma, causing $500,000 in damage

McDonald's recovers from global outage, blames it on third party

More in Australasia

One NZ to acquire Dense Air’s New Zealand business

Quinbrook launches cloud-based carbon accounting

Discussion DCD>Major Panel: The big sustainability challenge - Is the APAC region at risk of falling behind?

Tags

Unlocking data center profitability: A guide to DCIM solutions

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies