Update, July 2: Cloudflare appears to be suffering another outage, with users experiencing 502 errors. The cause of outage is unknown, although Cloudflare said that it is "observing network performance issues." Our coverage on that outage can be found here.
For a deep dive into both outages, be sure to read our interview with Cloudflare CEO Matthew Prince here.
Original story: Customers of network services provider Cloudflare saw their websites slow or go offline due to a Verizon BGP (Border Gateway Protocol) error.
The intermittent outage lasted roughly 1hr 42m, briefly taking down services including popular chat app Discord, as well as Reddit, Twitch and others.
Bad, not Good, Problem
Cloudflare said in a statement: "Earlier today, a widespread BGP routing leak affected a number of Internet services and a portion of traffic to Cloudflare. All of Cloudflare’s systems continued to run normally, but traffic wasn’t getting to us for a portion of our domains. At this point, the network outage has been fixed and traffic levels are returning to normal.
"BGP acts as the backbone of the Internet, routing traffic through Internet transit providers and then to services like Cloudflare. There are more than 700k routes across the Internet. By nature, route leaks are localized and can be caused by error or through malicious intent. We’ve written extensively about BGP and how we’ve adopted RPKI to help further secure it."
Cloud provider Amazon Web Services also suffered some issues, but to a lesser extent. The company said on its status page: "Between 3:34 AM and 6:01 AM PDT we observed an issue with an external provider outside of our network, which impacted Internet connectivity between some customer networks and multiple AWS Regions. Connectivity to instances and services from other providers and within the Region was not impacted by the event. The issue has been resolved and connectivity has been restored."
BGP route leaks are not without precedent, with a similar incident earlier this month taking out WhatsApp, while another in November 2018 caused significant outages across Google’s portfolio of services.
Network monitoring company ThousandEyes told DCD in a statement:
“Starting at approximately 7am ET, a major Internet disruption occurred in what appears to have been a significant BGP route leak event affecting a variety of prefixes from multiple providers, including Cloudflare, AWS and several others. Sites served through the Cloudflare CDN were impacted for nearly two hours. The incident does not appear to be malicious. DQE, a transit provider, appears to have been the original source of the route leak, which was propagated through Allegheny Technologies, a customer of both DQE and Verizon. Unfortunately, Verizon further propagated the route leak, magnifying the impact.
"BGP route leaks are not uncommon on the Internet. This incident is yet another example of how incredibly easy it is to dramatically alter the service delivery landscape in the Internet. The deeply interconnected nature of the Internet means that a glitch in one part of the infrastructure can very easily have cascading effects on another.
"In this case, our monitors detected that the leaked routes were advertised to the Internet from the Allegheny Technologies network. However, it does not appear likely that Allegheny was the original source of the leaked routes. The route leak event has two distinctive characteristics: large scale of leaked routes and the presence of more specific prefixes pertaining to reaching another organization’s network--in this case, Cloudflare.
"Allegheny is not a provider, but rather a metals manufacturer that peers with DQE and Verizon. The sheer scale of routes involved is more consistent with route leaks coming from transit providers.
"Outside of route hijacks, the presence of more specific prefixes for a third party network is unusual to see. Typically in route hijacking, the more specific routes are announced as the origin in order to authoritatively draw all traffic to the advertising network. In this case, we didn’t see the more specific routes announced as origin, which would appear to rule out an intentional route hijack.
"There are very few reasons to generate more specific prefixes for a third party network. One reason to do so is when a transit provider wants to optimize the cost and performance of delivering traffic from various Internet sites to its downstream customers, which can be accomplished using BGP optimization software.
"Based on these evidence points, it seems more likely that DQE as a transit provider was the original source of the route leak, which included a set of more specific prefixes for Cloudflare that may have been used for route optimization.
"It appears that Allegheny readvertised the leaked routes to Verizon. Unfortunately, Verizon didn’t have filtering mechanisms in place to stop this large-scale route leak at its peering with Allegheny, and propagated the leaked routes further.
"This event reminds us of what we saw back in November when a BGP route leak was traced to MainOne, a small ISP in Nigeria, that ultimately resulted in some Google traffic being re-routed on a global basis.
"This incident also points out that Internet routing is still incredibly vulnerable. Route leaks from smaller networks are often propagated by large providers, even though there are common filtering techniques available to reduce the impact of these events.
"Ultimately, in a cloud-centric world, enterprises must have visibility into the Internet if they’re going to be successful in delivering services to their users. Most enterprise IT teams are still not aware of how different the Internet is as an infrastructure as compared to carrier and enterprise networks, and are unprepared for such an unpredictable environment. If you can’t see what’s happening, you can’t hold providers accountable and solve problems.”