An outage experienced by Microsoft Azure DevOps in Brazil was caused by a typo in the update code.
On Wednesday, May 24, Microsoft Azure DevOps went offline from 12:10 to 22:31 UTC after the Azure SQL Server was accidentally deleted.
In a status update report, Microsoft principal software engineering manager Eric Mattingly said: “During Sprint 222, we upgraded our code base to replace the deprecated Microsoft.Azure.Management. packages with the supported Azure.ResourceManager. NuGet packages.
"This resulted in a large pull request of mechanical changes swapping out API calls. Hidden within this pull request was a typo bug in the snapshot deletion job which swapped out a call to delete the Azure SQL Database to one that deletes the Azure SQL Server that hosts the database.”
DevOps engineers regularly take snapshots of databases to explore problems or test improvements and rely on a background system that runs daily and deletes old snapshots.
Sprint 222 was initially run internally with no issue but, when deployed to the customer environment, had access to a snapshot database old enough that it triggered the delete bug.
In deleting the server rather than the database intended, the code then deleted all seventeen production databases for the scale unit making it unable to process any customer requests. According to the company, there was no data loss experienced during the outage.
Despite Azure being aware of the issue within 20 minutes, fixing the problem took upwards of 10 hours. According to Mattingly, this was due in part to getting an Azure engineer engaged and working on the problem and some databases being created before the Geo-zone-redundant backup was available meaning some databases had to be copied to a paired region which added several hours.
The final cause for the delay, according to Mattingly, was the result of a set of issues with Azure’s servers in which the w3wp processes kept recycling and each time would perform a warm-up task that took around 90 minutes to complete, causing the web server’s health probe to fail.
“Since this process was staggered amongst all web servers, by the time it finished only one or two servers would be back in the load balancer and taking customer traffic. They would then become overloaded, and the cycle would start again. Toward the end of the outage window, we blocked all traffic to the scale unit with our Resource Utilization feature to allow all web servers to warm up and successfully enter the load balancer. This resulted in users receiving rate limits and usage errors. Once all databases were healthy, we gradually unblocked users to ramp the customer traffic up to normal levels,” said Mattingly.
Microsoft has deployed several adjustments to prevent this from occurring again in the future.
In January of this year, Microsoft was hit by a widespread outage affecting 365, Outlook, GitHub, Teams, and more. The outage lasted for five hours and was ultimately explained by a Wide Area Network (WAN) router IP change.