IT outages are inevitable, here’s how to effectively manage your next one

In the last few months, we’ve seen some major IT failures: a daylong Wells Fargo outage that prevented customers from accessing their accounts, an Amtrak failure that left 60,000 Chicago passengers stranded, and a global outage of Gmail and Google Docs that prevented people from using those products.

Make a plan

And then there was the VFEmail.net hack in February, which resulted in complete loss of all client data – including backups.

These and similar IT problems offer us two important takeaways:

IT outages can happen to anyone (and will eventually happen to everyone).
The extent of damage your next IT outage causes depends on how well you prepare for it right now.

It’s also important to note that over 60 percent of IT outages or “disaster events” are caused by human error. So how can you minimize the damage that your next IT outage causes to your revenue, reputation, and customers?

First, make sure you have a business continuity plan (BCP) that includes both a disaster recovery plan (which outlines how you’ll handle your IT) and a plan for keeping the rest of the business going (e.g., communicating if key channels are down, making sure key people know what’s going on, establishing a meeting place, defining a chain of command, etc.).

Here, I’ll outline four crucial steps for being effective on the IT side.

Define potential disaster scenarios

For most companies, there are two major IT disaster scenarios:

System outage, in which some key part of your network or application malfunctions and you or your services are “offline” for a period of time. This is, usually, a relatively easy point of recovery as you are back online with minimal transactions impacted by the downtime.
Data loss, in which you lose information, content, or data (either your own or your clients’). It’s not always possible to recover from a data loss, as in the VFEmail.net hack, in which all copies of backups were deleted.

The first step to ensuring you’re ready for a disaster is understanding your risk profile for these common types of outages: what capabilities will be affected by a system outage? How crucial are those capabilities to running your business? Will an outage cause data loss? What other events might trigger data loss? Etc.

And again, remember that human error will be the most prevalent cause of both types of disasters (as in the Amtrak incident, when a worker fell on a circuit board during a server update).

Assess the potential damage to your business

This is a job for IT and other leaders to do together. The goal is to understand how your business, as a whole, will be affected if its individual pieces are down or if various types of data are lost.

In these conversations, aim to understand dependencies among business-critical apps (e.g., you know you need the payment processing app to be live, but does it depend on the inventory app to function?), clarify the effect on users that outages will have, and assess the financial impact of each minute of downtime for your business.

Useful benchmarks here:

RTO (recovery time objective), which defines how long your business can survive offline without causing serious damage. Your DR plan should outline a strategy to restore business operations by the RTO you define.
RPO (recovery point objective), which defines how long you can go between data backups without significantly hurting your business operations. Your business interruption analysis will define your RPO. (So if your DR plan calls for restoring data from the last known backup, the RPO defines how far back it’s acceptable for that backup to be.)

If you want to take one additional step, be sure the assessment includes an estimate of the damage to your reputation that downtime might cause. This is much harder to calculate, but it can be a valuable component in the decision-making process.

The goal of these exercises is to illuminate what kind of downtime you can afford as a company so that you can build a DR plan that fits.

Review your current disaster recovery plan

Once you know what kind of downtime your business can reasonably afford, take a look at your current DR plan. If you’re like most businesses, you have one but haven’t been diligent about updating it or testing it regularly. Now’s the time to change that.

As you review your DR plan, consider the following:

Does it reflect the realities of your business today, including plans for business-critical apps as articulated in your earlier conversations? If not, hop down to the next section, because you’ll need to update it.
Is it right-sized? IT teams are excellent at coming up with creative ways to do DR. This is in part because these systems are their babies and they’re very attuned to all the ways things can go wrong. But elaborate DR is often more than a company needs – and more expensive than the company can afford. If you’ve determined that you can afford three days of downtime and your current DR plan has you back online in six hours, it’s time to make some changes. Again, refer to RTO and RPO here.
Have you tested it? I get it. Many DR plans are developed to check a box or meet a regulatory requirement. But if you don’t test your plan, it’s worthless to you in a real disaster. You have no way of knowing whether it will actually prevent the kind of revenue loss and reputational damage that unexpected outages and data loss can cause.

Update and test your DR plan

I work with a lot of businesses. Most of them don’t regularly update and test their DR plans. It’s a nice-to-have project in a world of must-have projects. That’s a big problem because an outdated DR plan is more or less worthless in the event of a real disaster.

Take these steps as you make changes:

Assign someone to be in charge of DR and testing. This means someone will be accountable if it goes wrong, which significantly increases the chances that testing gets done.
Make sure the C-suite is aligned with the importance of having a DR plan and conducting regular stress tests. This is crucial to get the participation you’ll need from non-IT colleagues.
Include a definition of “disaster.” Know when and how you’ll launch your DR plan – after an hour of downtime? A day? Define, too, who makes this call and who makes the call if that person is out.
Put disaster-prevention rules in place. The Amtrak disaster I cited earlier happened in part because the company did a server update during peak usage hours. That is an incredibly preventable error: if the worker had fallen on the circuit board in the middle of the night, very few travelers would have been affected and the story may not have made the news.
Include a communication plan. Being transparent with stakeholders during a disaster (“here’s what’s happening”) and after (“here’s what happened and what we’re doing to improve performance in the future”) will go a long way toward mitigating any reputational damage a disaster may cause.

Effective DR is all about details

While it’s true that every business should have and test a DR plan, it’s also true that no two businesses are alike in what they need or how they should respond to disasters. For any business, DR should be based on two things: their risk profile and their ability to recover from an event – large or small.

To make sure your next IT outage causes as little damage as possible to your customers, your revenue, and your reputation, spend time understanding the specifics of what can go wrong and how those problems will affect your customers – and build a DR plan to minimize that impact.

IT outages are inevitable, here’s how to effectively manage your next one

Make a plan

Define potential disaster scenarios

Assess the potential damage to your business

Review your current disaster recovery plan

Update and test your DR plan

Effective DR is all about details

Further reading

Preparing for disaster: Sunday Opadijo, DCD 2018 award winner

Facebook, Instagram, WhatsApp suffer global outage

BA to sue CBRE over £58m data center outage

Tags

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies

Success story: Kao Data and Cadence