One of the most prevalent themes over the past year has been resilience. As a global pandemic destabilized the underlying assumptions on which many businesses were built, it underscored the importance of being resilient in the face of disasters and drastic change. This logic should also apply to IT infrastructures.

Organizations need to accept that infrastructure disasters aren’t just a possibility, they are an inevitability (whether an outage, breach, fire, storm, flood - or in at least one case, mice). However, a disaster in a data center doesn’t have to damage customer experiences. To win in 2021 and beyond, companies need to look past the notion of disaster recovery and instead build and plan for true IT resilience.

Let’s look back to November 25, 2020. On the biggest online shopping weekend of the year, AWS experienced a major outage. It brought down thousands of services across the internet. Apps and services like Roku and Adobe went offline. Home security systems ceased to work. Behind the scenes, the IT teams at AWS and impacted companies scrambled to restore operations with every additional moment offline degrading hard-won customer loyalty. IT teams cannot afford to be caught off guard by outages like this.

However, despite the broad scope of the outage, not all AWS customers were impacted. Why?

Issue 40 Front Cover.png

Issue 40: How data centers survived the Texas storm

Texas froze over, data centers burned down, and semiconductor fabs struggled with drought. The last three months have been chaos, but data center resiliency has helped the industry prevail.

How did companies like Netflix and Apple survive?

Rather than focusing on disaster recovery, forward-looking IT organizations worked backward from the customer experience. During the outage, major AWS customers Apple and Netflix continued their services without interruption. They practiced IT resilience, and ensured that their customers did not experience downtime due to chaos in their data centers.

As more economic activity takes place digitally and customer expectations for user experiences continue to rise, the cost of downtime continues to increase. To thrive in this new environment, companies need to leave “disaster recovery” behind and shift to a posture of IT resilience. Below are three ways to shift the paradigm, with a focus on the backbone of most enterprise IT stacks: the relational database.

Disasters can’t be exceptions so they must become non-events.

By 2022, 75 percent of relational databases will migrate to a cloud environment, according to Gartner. This migration is, in part, driven by a desire to defer to the experts when it comes to running infrastructure. But even in expert-run clouds, disasters still happen. As well as AWS, during the last eight months IBM, Facebook, and Google, to name a few, have all experienced outages that resulted in customer-facing downtime. We need to expect failures as the rule, not the exception. Hardware will randomly fail, but your customers' experience needs to remain constant.

Disaster recovery plans – the runbooks accumulating virtual dust from lack of use – are outdated. It's time to evolve for resiliency rather than recovery. A popular emerging pattern for doing this is multi-region scale-out databases rather than scale-up databases. This enables you to distribute risk (and computing power) across multiple machines while the entire cluster behaves like a single logical database. If a machine, data center, or region goes down, you can now preserve customer experience.

Planned downtime and "scheduled maintenance" must be eliminated.

Chaos related downtime can result in long nights for IT teams, but how do we think about downtime that is scheduled for routine systems maintenance? From the point of view of end-users, this can also be disastrous if a mission-critical service isn’t available when a user needs it most.

Sources of planned downtime include scaling up monolithic relational databases, changing the structure of those databases to support new functionality, database upgrades, and OS security updates.

We need to eliminate “planned downtime” from our vocabulary.

With a global customer base, there is no good time to take your app offline – it will be disruptive to someone somewhere. Companies instead need to be proactive by building architectures that allow for online maintenance, in production. Modern systems should support no-downtime rolling upgrades for maintenance work, and should allow for some application updates s on the fly without impacting your end-users.

Stop reporting RPO & RTO. They only tell half the story.

This disaster recovery paradigm shift towards IT resilience will require organizations and IT professionals to rethink the accompanying metrics for survivability. Two common metrics in the disaster recovery space are recovery point objective (RPO) and recovery time objective (RTO) – how much data will be lost in an outage or failure and how long it takes to recover. These metrics are a good starting point, but they only tell half the story. They fall short when it comes to comparing architectures that enable IT resilience.

The missing component is that depending on the architecture, the odds of an RPO/RTO triggering disaster can be wildly different. IT teams should understand the chances of an RPO/RTO event occurring, the cost of every second of downtime and data loss, and what that means for their expected cost of disasters over the long term. If companies optimize for expected cost of downtime rather than simply RPO/RTO they will find that modern architectures can give them dramatically better results by reducing the chance of an outage, even with apparently similar RPO/RTO characteristics.

The way forward

The time has come for companies to rethink their architectures to work backwards from their ideal customer experience. Hardware will fail, software updates will happen. The key is in limiting, or even eliminating, the impact on the customer. That is what it means to prioritize IT resilience.