Cookie policy: This site uses cookies (small files stored on your computer) to simplify and improve your experience of this website. Cookies are small text files stored on the device you are using to access this website. For more information on how we use and manage cookies please take a look at our privacy and cookie policies. Some parts of the site may not work properly if you choose not to accept cookies.


Does reliability clash with efficiency?

On our site, outage stories always get clicks: it’s just human nature. Those in the industry need to know what’s gone wrong, and how it happened.  

So in the last month, we’ve written about a failure which took down the UK’s Financial Conduct Authority, a power supply glitch which hit Global Switch in London, and a fire-suppression test which took a Romanian bank offline. Before that we heard of a Google Cloud outage, a network card which downed Belgium’s Belnet, and a failure at Delta Airlines which stranded the company’s passengers for a day.

Delta Airlines plane

Source: Makaristos/ Wiki Commons

Live with it?

Are these failures something we should expect and live with, or is it possible that some of them are happening because the industry is cutting costs, or pushing too hard for some kinds of efficiency? 

In the case of Delta, apparently a power control module failed and caught fire, and some servers didn’t have backup power available. That sounds like under-investment, in an industry which has been cutting costs and working on the edge of bankruptcy for years. 

Even when investments are adequate there will inevitably be some failures. These are complex systems, effectively a combination of humans and machines, and authorities on risk suggest they will fail at some point in 150,000 to 200,000 hours.

I’ve heard it said that whenever a data center fails, it’s usually the backup power, or some other failsafe device. Of course, that might be because the failsafe systems work so well that they are the only failures you actually see!

As well as under-investment, the other reason for over-stretching resources is when the organization makes an efficiency drive, striving to do more with less.

Web-scale providers can make the trade-off. If it’s mis-applied elsewhere, we might be in trouble 

I’ve heard it suggested that sometimes this desire to run more efficiently is actually reducing the resilience of data centers. It uses more resources if you over-chill and have twice as much power available as you need, but it makes your site more reliable. Cut down on the overprovisioning, and everything is operating much more tightly.

Watch for the signs

These trade-offs are made very consciously in the web-scale arena, where millions of identical servers are performing ant-like tasks. A few failures here and there, and jobs can be passed to a different server. Effectively, resilience is built into the IT, not the facility.

That’s fine at web-scale, but if that approach were to be mis-applied in non-webscale facilities, we might be in for trouble. We have to remember that web-scale infrastructure is designed to tolerate a lower reliability infrastructure, supporting a software architecture and a set of applications that can tolerate this.

I’ve heard people suggesting that web-scale benefits can be delivered on smaller-scale and enterprise-level systems. If this gets followed through, I think we should be watchful for signs that over-emphasizing efficiency is cutting reliability.

A version of this article appeared on Green Data Center News

Readers' comments (1)

  • In our cooling systems, the best way we have found to overcome the sometimes antagonistic requirements of reliability and efficiency,is to put two independent cooling loops inside one packaged indoor "CRAC" unit.
    The 50kW free cooling loop, with EC fans and pump consuming around 2.5kW, will pull the pPUE down as close to 1.05 as ambient conditions allow.
    However, none of the foregoing components are considered critical, because the autonomous DX loop is always there with its inverter compressors to meet the load. Of course the pPUE will rise to nearer 1.20,, but even that is not bad in a failure situation or under design summer external conditions!

    Unsuitable or offensive? Report this comment

Have your say

Please view our terms and conditions before submitting your comment.



More link