A March outage experienced by Microsoft's Azure cloud in South Africa was caused by a series of cuts to subsea fiber cables.
In a recent Azure Incident Retrospective, Microsoft executives discussed the outage which lasted from 10:33 UTC on March 14, until 11:00 UTC on March 15 for the company's South Africa North & West regions.
The outage was caused by multiple subsea fiber cuts along the East and West coasts of Africa.
According to Dave Maltz, technical fellow and corporate vice president of Azure's networking team, Microsoft's Azure region in the country has a four-way redundancy plan. In other words, all the Internet traffic has four different paths that it could take in and out of the South Africa region.
"These are physically diverse, so as issues happen to one, we have enough resiliency and capacity to carry the traffic on others," explains Maltz.
The redundancy plan means that even if three of the four paths were affected, the region would still be able to run as normal. However, in this case, 3.5 of the paths were impacted and Microsoft no longer had the capacity available.
The first subsea cable issue happened in the Red Sea. Current theories suggest that a cargo boat, heavily damaged during an attack by the Houthi Rebels, was drifting and dragging its anchor across the sea bed, accidentally damaging cables.
According to Maltz, Microsoft is "continuously running simulations of what would happen if this set of stuff broke.
"Those simulations are ongoing and looking not just with the current traffic load, but also predicted peaks. Any time we detect a situation where an expected series of failures would result in us not having enough capacity, that triggers what we call our urgent augmentation process where we create a list of pathways in the network where we need additional capacity.
"We then partner across Microsoft - the networking team along with the finance teams and the folks who actually go and acquire fiber or the rights of way necessary to build it, and we start construction of new paths that will give us enough capacity to survive those future fiber failures."
According to Maltz, after the East Coast outage, such simulations began and found that additional failures could result in them running out of capacity. Because of this, the urgent augmentation had already begun by the time of the second outage. "This led to us being able to deliver new capacity on an additional cable system far earlier than we would otherwise."
The next fiber cut, this time along the West Coast of Africa, was caused by seismic activity, leaving Microsoft without sufficient capacity to keep running as normal.
"The Azure networking team then responded to this by finding ways to reduce the amount of traffic that was flowing across these fiber links. Basically all the discretionary traffic and all of our internal replication traffic that isn't time sensitive, we shifted that away or delayed it completely," said Maltz.
While fiber cuts are quite common, subsea cable cuts are different in terms of time to repair. Frank Rey, partner and head of hyperscale network connectivity, said: "They are thousands of kilometers away from any kind of port, and there's a small number of ships that can go and repair these."
In this case, the cable cuts occurred off the West Coast near Ghana, and repair ships were sailing from Cape Town.
"A terrestrial cable's mean time to repair is anywhere between four to eight hours, whereas a subsea cable is measured in weeks." Because of this, you cannot simply wait for the repair to be conducted, other solutions must be sought to find capacity.
Over the weekend following the second incident, Microsoft managed to triple the capacity it had remaining and "put a new diverse path on the West Coast to service South Africa."
Rey added: "We are also in the process of expediting a fifth path in the region that would go from Johannesburg up into our region in the UAE to add additional resiliency to South Africa.
"On top of that, we have some investments that have been in flight for some time now. We are investing over $100 million to serve our South Africa regions for our WAN capacity and Microsoft Capacity. So some of the new cables that are being built to serve South Africa are going to have our own equipment on them which will add even more headroom."
Another solution used by Microsoft was to take Lagos away from being an Edge site. Essentially, they took all the capacity from that location and brought it to South Africa. "The impact of this was that customers in Lagos that were looking to reach Microsoft services would see their traffic route via a third party ISP to another place where we interconnect with that ISP and then entered our network."
Once all the cables are repaired, Lagos will go back online in the Microsoft Network.
Finally, Microsoft uses a bandwidth broker which essentially gives services a lease to get the capacity to send traffic across the network. During an outage, that bandwidth broker can shift which services get capacity on the network and prioritize those that are more important.
In February of this year, Microsoft announced plans for a new data center campus in Centurion, South Africa.
The first US cloud provider to enter the country, Microsoft opened two Azure regions in South Africa in 2019, in Johannesburg and Cape Town.
The Cape Town region, however, was de-listed at the start of 2021 and may have been re-classified as a ‘reserved access region.’