On June 8, Fastly—a global content delivery network (CDN) provider—suffered an outage that impacted the sites and applications of many of its customers. The outage lasted for about an hour and later Fastly issued a statement identifying a latent bug introduced in a software update as the culprit.
The blackout for some of the world’s most visited sites served as a stark reminder that the Internet can fail. Still an elusive black box to even the most seasoned network professionals, it is a complex ecosystem of providers and interdependencies, and whenever something goes wrong with one of them, it can have a massive impact on web users globally.
Yet, Fastly’s outage didn’t affect all of its customers in the same way. What became clear is that customers’ site operations teams with a redundancy plan informed by an understanding of the Internet’s underlying protocols and services were able to act quickly and reduce the impact on their own customers.
Here, we unpack how the outage unfolded for different organizations and why knowledge of the Internet-dependent services—internal and external—is essential to avoiding the dreaded downtime list.
Understanding CDNs like your ABCs
In order to build, operate, and troubleshoot applications and services reliably over the Internet, you first need to understand the various building blocks that make up the web - and this includes content delivery network (CDN) providers.
Many services are composed of dozens or hundreds of different web objects which leverage different CDN providers to deliver content to users, primarily for redundancy but also for optimizing performance. For example, user requests could be load balanced across multiple CDNs using DNS query responses. Alternatively, the root object for a site could point to an index.html file served by one particular CDN provider, but subsequent site components could be served by different CDN(s) or other sources.
Ultimately, how a site or application owner chooses to architect its content delivery can determine the severity of impact of an outage like the one Fastly experienced.
An outage of many outcomes
Although the outage took an hour to resolve, some of Fastly’s customers were able to minimize the impact to their services by leveraging alternative providers to deliver content.
Some of Fastly’s customers had resilient delivery architectures or were able to take action to mitigate the impact of the incident—leading to a number of different outcomes for their users. While there was a dramatic, global drop in the availability of Fastly’s service, not all of the content it delivered went offline, and not all customers were equally affected.
Some customers were using Fastly’s service as the sole CDN for their primary site domains. However, that doesn’t mean they were impacted in the same way.
Upon review, one customer using Fastly as their sole CDN showed evidence of remediation actions from the company’s IT teams. By looking closely at the network path, it’s clear site operators re-routed traffic away from Fastly servers to GCP ones to lessen the impact of the outage by implementing a manual update about 40 minutes after its site went offline.
Another customer used not one but three CDNs to deliver its site, continuously load balancing traffic across each to deliver the best possible experience to visitors. When the Fastly outage hit, therefore, the customer began removing Fastly from its DNS responses to move traffic away from the impacted servers.
A reminder on redundancy
While the Fastly outage was far-reaching, not every site utilizing its services experienced severe effects. What became apparent was that customers leveraging multiple CDNs were only partially impacted and eventually were able to fall back on alternative providers. Businesses using Fastly as their sole CDN, were taken offline completely, and although we know some customers were able to redirect users to their origin servers, the manual process resulted in further delay in getting their users back online. So, what can we learn from the Fastly outage?
Firstly, it’s critical that enterprises diversify their delivery services. Just as redundant DNS is best practice, two or more CDNs should be considered to ensure optimal delivery and to reduce the impact of any one CDN experiencing a disruption in service.
Secondly, a business needs to understand all of its dependencies—even indirect, “hidden” ones. For example, external sites for site and app components may also have dependencies such as DNS or hosting so they need to have a full understanding of these so they can ensure they are also resilient.
Having a backup plan for when outages (inevitably) happen is also part and parcel of operating within a modern, digital supply chain. Organizations need to be sure they have visibility into early warning signs of any issues, so they can activate their backup procedures. Better still, continuously evaluating the availability and performance of site delivery will empower businesses to proactively detect problems and respond quickly before they impact service.
In many ways, the Internet has become the new enterprise backbone - especially in the last year. The Fastly outage is a fitting example of the current state of the hyper-connected web and the role that key providers play in the digital delivery of web content. But it’s also a firm reminder on the need for understanding architectural differences and, ultimately, on the importance of redundancy for every critical site dependency.