Network switches fail less often when they are running the open-source SONiC operating system, instead of proprietary software, according to a giant analysis of network data carried out by Microsoft.
Only two percent of network switches will fail within three months, but that figure is cut in half if the vendor's operating software is replaced by SONiC - an OS developed by Microsoft and the Open Compute Project - according to the study which tracked 180,000 switches in Azure data centers for three months. The study also found that hardware causes more failures than software and spotted that one vendor's switches are twice as failure-prone as its rivals. The study did not reveal which vendor had the less-reliable switches, however.
Sonic switch upgrade
For three months, researchers at Microsoft gathered data every time a network switch in an Azure data center rebooted. The team, led by distinguished engineer and vice president David Maltz, took data from 180,000 switches in 130 locations, and then winnowed this data to separate failures from planned maintenance events before analyzing different causes and checking the failure rates of different hardware and software combinations.
The peer-reviewed paper concluded that there is a two percent chance of a switch failing in three months, and hardware was the most common problem, causing 32 percent of those reboots. Unplanned power loss caused 28 percent of failures, with software causing 17 percent of the problems. The failures were found by polling every switch on the network every six hours and asking about its most recent reboot.
The paper acknowledges that it's not always easy to distinguish hardware faults from software faults: "Both hardware and software failures can manifest in the form of process crash logs and stack traces on the faulty switch, making it hard to differentiate between them in practice. Additionally, even when the failure logs clarify that the cause is a hardware fault, there can be ambiguity about the real source of the error. For example, parity errors encountered on switches can be caused by faulty hardware, bit flips etc., but they can be mitigated in the switch software. Therefore, it is unclear if these faults should be attributed to the hardware or the software.
The team took a programmatic approach: "If the manufacturer provides a software fix for the failure, we categorize the fault as a software failure."
Software failures could be reduced by using the Microsoft/OCP SONiC network operating system, according to the study. Around three-quarters of the switches were running SONiC, and the study found these were one percent more likely to survive three months - is nearly halving the failure rate. "With time, the gap in reliability widens and at the end of three months, the survival likelihood of SONiC switches is 1 percent higher than that of non-SONiC switches."
The reason for SONiC's performance is down to a responsive open-source software model, with in-house software developers: "We attribute the resilience of SONiC to the rapid develop-test-deploy cycle made possible by the in-house development of the software, says the paper. "Indeed, vendor software updates and patches are rolled out over longer timescales (e.g., several months). This leads to known issues re-occurring on devices that have not yet been patched with the vendor-supplied fixes. In contrast, SONiC failures are root-caused and fixed over short timescales due to in-house expertise and development teams."
As Timothy Pricket-Morgan at The Next Platform commented: "This is truly the benefit of open source software. Your stuff is publicly laundered, which can be harsh, but it is also publicly cleaned, which can be much faster than a vendor fix because a bug is public in nature and there is enlightened self-interest at work, not just a desire to hide a problem and work on it secretly."
Who's the weakest link?
The Azure data set included switches from three major manufacturers, and the report found one of them was significantly more likely to fail than the other two. All the failure rates were low, and Microsoft anonymized the data so we can't know which vendor had the less-reliable switches.
According to the paper's terms, "Vendor-2" had twice the "hazard rate" of the most reliable vendor, Vendor-3. In practice, this means that after three months, three percent of Vendor-2 switches had failed, while only 1.5 percent of Vendor-3 switches had failed. The majority of Vendor-2 failures were unplanned power losses, while Vendor-3 switches tended to have hardware failures.
We can't know for sure which vendors were in the study. Cisco, Arista, Huawei, HPE, and Juniper are among the leading switch vendors in the market as a whole. Hyperscale cloud vendors like Microsoft publicly support white label switches, but Mellanox and Arista Networks have both cited Microsoft as a customer in the past.
We do, however, know that the majority of the switches in the Azure data centers in this study were from one manufacturer, known as Vendor-1. "During our study period, nearly 75 percent of the aggregation switches in the cloud data centers were running SONiC on existing vendor-1 hardware".
The SONiC operating system runs on multiple hardware platforms, including Arista, Cisco Nvidia/Mellanox, Dell, Juniper, and Nokia. Microsoft has a long-running partnership with Arista which included developing the Switch Abstraction Interface (SAI) on which SONiC runs, so Arista has to be a good bet for one of the vendors involved.
Microsoft offered both SONiC and SAI to the Open Compute Project back in 2015.