Uptime Institute’s Annual outage analysis, published early this year, called attention to the persistent problem of IT service and data center outages. Coupled with our annual survey data on outages, the analysis explains, to a degree, why investments to date have not greatly reduced the outage problem — at least from an end-to-end service view.
Gathering outage data is a challenge: there is no centralized database of outage reports in any country (that we are aware of) and, short of mandatory rules, there probably won’t be. Uptime Institute’s outage analysis relied on reports in the media, which skews the findings, and on survey data, which has its own biases. Other initiatives have similar limitations.
The US government also struggles to get an accurate accounting of data center/IT outages, even in closely watched industries with a public profile. The US Government Accountability Office (GAO) recently issued a report (GAO-19-514) in which it documented 34 IT outages from 2015 through 2017 that affected 11 of the 12 selected (domestic US) airlines included in the report.
The GAO believes that about 85 percent of the outages resulted in some flight delays or cancellations and 14 percent caused a ground stop of several hours or more. By contrast, Uptime Institute identified 10 major outages affecting the airline industry worldwide in the period since January 2016.
The Uptime Institute data is drawn from media reports and other more direct sources. It is not expected to be comprehensive. Many, many outages are kept as quiet as possible and the parties involved do their best to downplay the impact. The media-based approach provides insights, but probably understates the extent of the outage problem - at least in the global airline industry.
Government data is not complete either. The GAO explicitly notes many circumstances in which information about airline IT outages is unavailable to it and other agencies, except in unusual cases. These circumstances might involve smaller airlines and airports that don’t get attention.
The GAO also notes that delays and cancellations can have multiple causes, which can reduce the number of instances in which an IT outage is blamed. The GAO’s illustration below provides examples of potential IT outage effects.
The report further notes: “No government data were available to identify IT outages or determine how many flights or passengers were affected by such outages. Similarly, the report does not describe the remedies given to passengers or their costs.” We do know, of course, that some airlines — Delta and United are two examples — have faced significant outage-related financial consequences.
Consumer complaints stemming from IT outages accounted for less than one percent of all complaints received by the US Department of Transportation from 2015 through June 2018, according to agency officials. These complaints raised concerns similar to those resulting from more common causes of flight disruption, such as weather. It is likely that all these incidents bring reputational costs to airlines that are greater than the operational costs the incidents incur.
The GAO does not have the mandate to identify the causes of outages it identified. The report describes possible causes in general terms. These include aging and legacy systems, incompatible systems, complexity, interdependencies, and a transition to third-party and cloud systems. Other issues included hardware failures, software outages or slowdowns, power or telecommunications failures, and network connectivity.
The GAO said, “Representatives from six airlines, an IT expert, and four other aviation industry stakeholders pointed to a variety of factors that could contribute to an outage or magnify the effect of an IT disruption. These factors ranged from underinvestment in IT systems after years of poor airline profitability, increasing requirements on aging systems or systems not designed to work together, and the introduction of new customer-oriented platforms and services.” All of this is hardly breaking news to industry professionals, and many of these issues have been discussed in Uptime Institute meetings and in our 2016 Airline outages FAQ.
The report cites prevention efforts that reflect similarly standard themes, with five airlines moving to hybrid models (spreading workloads and risk, in theory) and two improving connectivity by using multiple telecommunications network providers. Stakeholders interviewed by the GAO mentioned contingency planning, recovery strategies and routine system testing; the use of artificial intelligence (although it is not clear for what functions); and outage drills as means for avoiding and minimizing system disruptions.
In short, the GAO was able to throw some light on a known problem but was not able to generate a complete record of outages in the US airline industry, provide an estimate of direct or indirect costs, explain their severity and impact or pinpoint their causes. As a result, each airline is on its own to determine whether it will investigate outages, identify causes or invest in remedies.
There is little information sharing; Uptime Institute’s Abnormal Incident Reporting System examines causes for data center-specific events, but it is not industry specific and would not capture many network or IT-related events. Although there have been some calls for greater sharing, within industries and beyond, there is little sign that most operators are willing to openly discuss causes and failures owing to the dangers of further reputational damage, lawsuits and exploitation by competition.