Earlier today, network services provider Cloudflare suffered a significant outage, bringing down all of its services - and with it a large portion of the Internet, including Discord, Marketo, Down Detector and more.
"We built Cloudflare with a mission of helping build a better Internet and, this morning, we didn't live up to that," Cloudflare CEO Matthew Prince told DCD. "I take personal responsibility for that. And so I think that that it's disappointing, and it's painful."
In an interview with DCD as he rushed to the airport, Prince explained what went wrong, dispelled rumors about nation-state attacks, and discussed last week's unrelated BGP outage.
Here's what happened
"There has been speculation online that this was caused by some sort of external attack," Prince said. "We don't see any evidence that this was related to an external attack, although that was also our own team's initial speculation."
While some blogs and social media chatter pinned the outage on a DDoS attack by the Chinese government trying to bring Hong Kong protesters offline, Prince denied the claims. "We needed to make sure that nobody believes that was the case, because it wasn't the case. And while it would be incredibly convenient if that had been the case - because that would be an understandable issue - this was not that."
Instead the problem, which affected users globally for up to 30 minutes, was caused internally by Cloudflare itself. In fact, it was Cloudflare's very own DDoS protection that was to blame.
"When we see an attack, our systems are designed to be able to scale up across all of those services to be able to mitigate it," Prince said. "Unfortunately, this morning it appears there was a bug in our firewall service that caused it to grow and scale over time, even though there was no attack that was targeting the service in any way."
Due to the bug, the Cloudflare Web Application Firewall (WAF) "was all of a sudden consuming significantly more CPU resources than would be normal." Unfortunately, the system was designed to spread that resource across Cloudflare's network that spans more than 180 cities globally. But instead of spreading the load of a finite attack, it spread the load of an ever-growing resource hogging bug that "for a while consumed our primary backup and our backup-backup system," with 100 percent CPU loads causing unprecedented CPU exhaustion.
Prince said: "This was a unique problem, it was something that we had never seen before. I understand the system decision that we made that allowed it to become as broad of a problem as it as it did, because we again wanted to be able to design our systems in such a way that when they saw large attacks, they could scale up and use all the resources necessary in order to mitigate them.
"Unfortunately...it appears that we did not have the appropriate controls in place to make sure that a bug like this didn't cause a wider spread issue. But we will going forward."
As the company learns more about what exactly happened, Prince promises full transparency, something he said was imperative to keeping customers after such incidents.
"We're fortunate that we haven't had a number of of significant issues," he said. "But I remember the first really significant issue we had was a hack back in 2012 that was incredibly painful. It affected one of our customers, and it impacted me personally, because the hacker had actually hacked into my personal email in order to get in.
"And I was embarrassed - frankly, I didn't want to share the details of everything that happened. Our team said 'No, that's not our culture, and that's not what we stand for and we really believe in this idea of being radically transparent with whatever happened.' I was afraid that we would lose customers. It turned out instead ...that transparency actually helped people build trust."
Equally important to retaining customers is honoring service level agreements (SLAs), something which Prince said "our team is already working to resolve. We've always taken a very broad view of whatever the SLAs are and honored them whenever there's something that is impacting the ability for customers to reach our network."
Prince also promised that the company had a "blameless culture" adding "unless there was active malfeasance or something that was going wrong, I can't imagine this is something that someone is going to lose their job over."
What makes this incident particularly painful - for Prince, the company, and its customers - is that Cloudflare was similarly unavailable just last week.
Bad timing
"Today's issue was entirely our problem, this was a mistake which we made," he said. Last week's, however, was an external failing.
"22,000 networks had their network routes hijacked, [something] which impacted some fraction of Cloudflare's network," Prince said. "That is much more of an Internet-wide issue, and something that ...[the] entire Internet community needs to work in order to resolve.
"The two things were completely and totally unrelated," he said admitting that even within Cloudflare there was speculation about a connection between the two outages: "You see something happening and you think 'oh, it must be the same thing as what happened before.' These were were completely unrelated processes, completely unrelated teams."
Last week's issue was a problem with the Internet's border gateway problem (BGP), which manages how packets are routed across the Internet. Telecoms company Verizon accepted and propagated a huge set of accidentally leaked BGP routes, essentially sending Internet traffic to the wrong locations. Individual service providers can set up specific routes to optimize their traffic; in this case one of these "more specific" routes was leaked and used far more widely than it should have been, creating an unwanted bottleneck (in this case the more specific route was from another ISP through a metals manufacturer's corporate network. It was inadvertently accepted and broadcast by Verizon, effectively inviting overload).
Alex Henthorne-Iwane, of network monitoring company ThousandEyes, explained in a blog post for DCD: "The end result was that a huge set of user traffic headed toward Cloudflare and other providers got routed through a relatively small corporate network, with predictable results. Massive congestion from the traffic redirection led to high levels of packet loss and service disruption. Users simply weren’t able to reach the Cloudflare Edge servers and the apps and services that depended on them."
This is a problem much wider than Cloudflare, an issue with the fundamental design of the Internet itself. "The good news about today's issue is that it is entirely under our control, and therefore something that I know that we can fix, that we can put safeguards in place, and will not happen again," Prince said. "We make mistakes all the time - but we make different mistakes all the time, which I think is a sign of a healthy organization.
"The frustrating thing about last week's outage is that it is not entirely under our control, and so it requires us to work with other large networks like Verizon in order to clean up their systems. And that's something that - while we'll be able to fix [today's] issue very quickly - the issue about BGP route leaks is going to be something that takes us a lot longer to resolve, and it's going to take much more than just Cloudflare working on it to get it fixed."
BGP problems, and actual nation-state sponsored outages, will be discussed in an upcoming issue of DCD Magazine. Be sure to subscribe for free today.
[Some edits made to speech, for clarity]