Through the back half of 2023, we observed an increasing number of data center outages caused by plant failures. This increase in prominence is unusual, to say the least.

Data center design has been refined and optimized considerably over the past decade. Facilities have grown in size to cater to hyperscalers, in complexity to act as interconnection points to a wide array of cloud services and fiber operators, and in number to cater to increased demand for technical floor space.

The plant used to run data centers is often deployed with 2N or greater levels of redundancy to ensure availability and uptime metrics are met. While there have been instances of gensets failing to start up and take production load in situations where it is called upon, these cases have been declining, as operators performed more thorough and regular tests of their redundant systems.

In that context, it may come as somewhat of a surprise to see plant failures increasingly being attributed as the cause of data center outages that, in turn, cause the loss of cloud services and applications that utilize servers running out of these facilities.

But there are several plausible explanations for the increased occurrence of these plant failures.

Outside conditions

Climate is an obvious explanation: power outages in facilities during the past year have often coincided with extreme weather events such as heat or storms.

During extreme heat, there is pressure on power grids generally; for data center operators, that can translate to power quality fluctuations such as surges and brownouts, and a potential need to generate some of their own power using an onsite plant to smooth supply and continue servicing IT equipment. The chiller plant also has to work harder to keep data floor temperatures within a specified safe range.

Storms, on the other hand, pose a different set of issues. Lightning strikes can knock out an onsite substation and one or more mains power feeds. There have also been cases where hail or heavy rains have led to water ingress into technical floor space, damaging equipment and shorting the power distribution plant in the affected area.

High-powered demands

While weather conditions explain some data center outages, others appear to be the result of a different phenomenon: the rise of compute-intensive, data-driven workloads being processed at these sites. For older sites, these workloads are pushing rack densities far beyond existing specifications, leading to a rise in the number of such facilities undergoing chiller and other plant upgrade and replacement projects.

Some operators are responding by separating intensive workloads to run in smaller, purpose-built, high-density sites. Previously in larger co-location facilities, intensive workloads would have been run out of designated rooms or data halls that catered to higher rack equipment densities. Typical rack densities traditionally max out at about 7kW, with high-density zones catering for racks up to 50kW. But in the current data-driven environment, racks no longer max out at 50kW densities: some are moving toward extreme densities of upwards of 200kW per rack.

It’s clearly inadvisable to ask older or more general colocation facilities to support these kinds of intensive compute workloads. It makes better sense to host them at purpose-built facilities that are designed to do one thing well: support extreme compute needs, by having the technical floor space and plant to match.

However, concentrating intensive workloads into a small footprint is also no guarantee of uptime. Having intensive workloads running side-by-side places more pressure on facility operators to ensure uptime by keeping the plant operating. Such environments contain a density of equipment that is going to be more sensitive to slight changes in power availability or cooling capacity, and any failure could degrade or damage compute capacity powering data-driven decisions for some of the world’s critical infrastructure.

Coding with care

Another possible explanation for rising data center failures is due to the infrastructure being abstracted away from its consumers. Architectural decisions of applications occur in isolation, without necessarily a good understanding of the underlying infrastructure requirements.

That’s because the elevation of platform-as-a-service (PaaS) and serverless architecture means developers can focus on creating code; they don’t necessarily need to understand the ins and outs of the underlying infrastructure, including how to limit the intensity of processing their application code requires to function.

In addition, applications often now utilize third parties to complete functions via APIs. This offloads more processing demand onto other parties and relies on those parties efficiently using underlying infrastructure as well. Inefficient code means sub-optimal infrastructure use. Multiply that by the number of applications that call a particular data center home, and it’s clear that this may be putting undue stress on data center plant to meet heightened processing demands.

Improving line of sight

In today’s environment, to avoid being caught off-guard, it is crucial to have the capability to detect any degradation happening at data center sites that a cloud service or application depends on. This is not only important to ensure immediate uptime, but also to improve the cloud service or application's design by reducing reliance on any single data center.

To ensure a seamless user experience, operators of cloud services and web-based applications need to be able to understand everything underpinning them. That likely includes extra consideration to the underlying infrastructure, including its physical (data center) location, and the capabilities of that data center in terms of its design and redundant plant.