An IP outage in a CenturyLink data center on Sunday brought Cloudflare servers down, affecting a number of websites hosted in the US and Western Europe.
Cloudflare CEO Matthew Prince shared the company’s own timeline of events in a blog post, saying that the ISP “experienced a significant outage that impacted some of Cloudflare’s customers as well as a significant number of other services and providers across the Internet.”
At 10:03 UTC, Cloudflare technicians noted an increased number of 522 Errors, indicating that there was a problem with their network connection to their host server.
The company’s automatic mitigation systems then came into play, rerouting traffic to alternative hosts including Cogent, NTT, GTT, Telia, and Tata Communications.
But services weren’t restored for another four hours, Prince said, because “many hosting providers only have single-homed connectivity to the Internet” through CenturyLink - not to mention that many US end-users are contracted to the ISP.
The outage, Cloudflare said, saw a 3.5 percent drop in global traffic - mostly due to CenturyLink customers unable to access the Internet.
CenturyLink hasn’t issued a statement to explain exactly what happened, simply Tweeting that it was “an IP outage” but the network operator ventured that it may have been due to a spike of Border Gateway Protocol (BGP) updates.
These emerged, according to the information obtained by Cloudflare, after the NOC team identified a bottleneck caused by a Flowspec rule - potentially issued by CenturyLink or indeed, one of its customers - to mitigate an attack on the network. Flowspec, (the BGP flow specification feature) deploys filtering and policing across BGP routers to mitigate the effects of distributed denial-of-service (DDoS) attacks.
Striking a conciliatory tone, Prince explained why it may have taken the ISP so long to get its servers back online: “Finally, it never helps when these issues occur early on a Sunday morning. Networks the size and scale of CenturyLink/Level(3)’s are extremely complicated. Incidents happen. We appreciate their team keeping us informed with what was going on throughout the incident.”
However much it affected some company websites, the consequences of this weekend’s outage were likely not as significant as an incident which occurred in 2018, when a network issue brought down 911 voice calls in parts of the US, affecting not just emergency services but Verizon mobile data, ATM withdrawals, lottery drawings, and hospital patient records for almost 24 hours.
Cloudflare has had its fair share of mishaps, including one caused by a Flowspec malfunction in 2013, and another in July, when a router issue in the company’s Atlanta data center took out numerous services including Feedly, Tumblr, Discord, and more, impacting 12 of the company’s data center locations across the US.