The overall frequency and severity of data center outages are on the decline, according to a new report from the Uptime Institute.
There are, on average, 10-20 high-profile IT outages per year that cause serious financial loss, business and customer disruption, reputational loss, or - in extreme cases - loss of life, according to Uptime.
IT capacity up, downtime down
Whilst there are more issues of downtime than in previous years, the rate of increase is lower than the pace at which IT capacity is expanding, leading to a percentage decrease in outages.
55 percent of operator respondents to the 2023 Uptime Institute data center survey reported having an outage in the past three years. This is a decline from 60 percent in 2022 and 69 percent in 2021.
Of these outages, only one in 10 was categorized as serious or severe in 2023. Operators told Uptime that 41 percent of outages in the past three years were negligible. This is an improvement of four percentage points from 2022 and 10 percentage points from 2021.
More than half (54 percent) of the respondents in the survey said severe outages cost more than $100,000, with 16 percent claiming that their most recent outage cost more than $1 million.
Cloud, Covid-19, and curbing complacency contribute to decrease
The report says decreased tolerance for complacency across sectors has contributed to the general decrease in outage frequency. High reputational costs resulting from outages have encouraged industry stakeholders to prioritize resiliency.
Uptime adds that organizations are investing in infrastructure redundancy, with enterprise, colocation, and cloud data centers all moving towards software-based resilience models. Previous expectations suggested that multi-site approaches would undermine physical site redundancy strategies.
Movement to the public cloud has not necessarily resulted in fewer outages. Instead, it has meant third-party suppliers are registered as the cause of IT disruptions, reducing the overall number of on-premises outages.
The impact of the Covid-19 pandemic has led to oscillations in demand, in turn straining supply chains and distorting outage rates. The report says supply chain disruptions stall capital projects and lead to delays in infrastructure upgrades. This has temporarily reduced the rate of incidents that often cause outages.
The use of distributed software-based resiliency, which can reduce outages over time, also has the potential to add new risks.
Power disruptions - the leading cause of outages
According to Uptime’s survey, 52 percent of respondents named power as the primary cause of recent impactful outages.
Over eight years, third-party operators, telecommunications, and cloud and Internet providers account for 67 percent of outages overall. These operators have seen a marginal, but constant, increase since 2020, rising by five percentage points to account for nearly one in 10 outages in 2023
This reflects the growing reliance on cloud hosting, SaaS, and colocation providers.
Telecommunications has experienced an increase in outages because of rising demand for connectivity and capacity across sectors. The criticality of mobile networks has meant outages can have an outsized impact.
Financial sector outages declined considerably in 2022 and 2023, potentially because of stricter regulations and oversight following a series of large, high-impact outages before 2021.
Four in five respondents say their most recent serious outage could have been prevented with better management, processes, and configuration.
Human error contributes to a significant majority of all downtime incidents
Across 25 years, Uptime estimates that human error, directly or indirectly, accounts for between two-thirds and four-fifths of all downtime incidents.
The most common cause of major human error-related outages is data center staff failing to follow procedures or processes (48 percent). This is followed by incorrect staff processes in place (45 percent), and installation issues at 23 percent.