Avoiding outages is a big concern for any operator or service provider, especially one providing a business-critical service. But when an outage does occur, the business impact can vary from “barely noticeable” to “huge and expensive.” Anticipating and modeling the impact of a service interruption should be a part of incident planning and is key to determining the level of investment that should be made to reduce incidents and their impact.
In recent years, Uptime Institute has been collecting data about outages, including the costs, the consequences, and most notably, the most common causes. One of our findings is that organizations often don’t collect full financial data about the impact of outages, or if they do, it might take months for these to become apparent. Many of the costs are hidden, even if the outcry from managers and customers (even non-paying customers) is most certainly not. But cost is not a proxy for impact: even a relatively short and inexpensive outage at a big, consumer-facing service provider can attract negative, national headlines.
Misleading figures
Another clear trend, now that so many applications are distributed and interlinked, is that “outages” can often be partial, affecting users in different ways. This has, in some cases, enabled some major operators to claim very impressive availability figures in spite of poor customer experience. Their argument: just because a service is slow or can’t perform some functions doesn’t mean it is “down.”
To give managers a shorthand way to talk about the impact of a service outage, Uptime Institute developed the Outage Severity Rating (see image). The rating is not scientific and might be compared to the internationally used Beaufort Scale, which describes how various windspeeds are experienced on land and sea. based on subjective experience..
By applying this scale to widely reported outages from 2016-2018, Uptime Institute tracked 11 “Severe” Category 5 outages and 46 “Serious” Category 4 outages. Of these 11 severe outages, no fewer than 5 occurred at airlines. In each case, multi-million dollar losses occurred, as flights were cancelled and travelers stranded. Compensation was paid, and negative headlines ensued.
Analysis suggests both obvious and less obvious reasons why airlines were hit so hard: the obvious one is that airlines are not only highly dependent on IT for almost all elements of their operations, but also that the impact of disruption is immediate and expensive. Less obviously, many airlines have been disrupted by low cost competition and forced to “do more with less” in the field of IT. This leads to errors, over-thrifty outsourcing, and makes incidents more likely.
If we consider Categories 4 and 5 together, the banking and financial services sector is the most over-weighted. For this sector, outage causes varied widely, and in some cases, cost cutting was a factor. More commonly, the real challenge was simply managing complexity and recovering from failures fast enough to reduce the impact.
The full report Annual outage analysis: The causes and impacts of publicly recorded IT service and data center outages from 2016-2018 is available to members of the Uptime Institute Network here.