Microsoft has revealed that the recovery from its recent Sydney, Australia, outage was impacted by the shortage of staff on site.
In the company's post-outage analysis, the cloud giant revealed that the night team on-site at the data center housing its Australia East region was as few as three people.
Microsoft stated that: "The staffing of the team at night was insufficient to restart the chillers in a timely manner. We have temporarily increased the team size from three to seven until the underlying issues are better understood and appropriate mitigations can be put in place."
The cloud giant explained the cause of the outage was due to the failure of chillers in the data hall. The hall holds seven chillers: five in operation and two on standby. At the time of the power sag, which was amid a thunderstorm, all five chillers were brought offline, and only one of the redundancy chillers was successfully brought on.
In the analysis, Microsoft explained this failure: "The five chillers did not manage to restart because the corresponding pumps did not get the run signal from the chillers. This is important as it is integral to the successful restarting of the chiller units. We are partnering with our OEM vendor to investigate why the chillers did not command their respective pump to start."
Because of the extended time without cooling, the hardware was turned off to protect it from heat damage. Ultimately, this also brought down the temperature of the chillers which enabled them to start working again, and at this point, Azure began turning its compute and storage units back online. In total, approximately half of Azure's Australia East Cosmos DC clusters went down or were heavily degraded.
In addition to the shortage of staff, Azure identified that the automation systems had failed to work effectively. The company stated that, in the future, they will look to improve this, stating that "The EOP for restarting chillers is slow to execute for an event with such a significant blast radius. We are exploring ways to improve existing automation to be more resilient to various voltage sag event types."
According to the analysis, including Azure, seven tenants were impacted: "Five standard storage tenants, and two premium storage tenants."