On our site, outage stories always get clicks: it’s just human nature. Those in the industry need to know what’s gone wrong, and how it happened.  

So in the last month, we’ve written about a failure which took down the UK’s Financial Conduct Authority, a power supply glitch which hit Global Switch in London, and a fire-suppression test which took a Romanian bank offline. Before that we heard of a Google Cloud outage, a network card which downed Belgium’s Belnet, and a failure at Delta Airlines which stranded the company’s passengers for a day.

800px n647 dl 2008 08 15 yvr
– Makaristos/ Wiki Commons

Live with it?

Are these failures something we should expect and live with, or is it possible that some of them are happening because the industry is cutting costs, or pushing too hard for some kinds of efficiency? 

In the case of Delta, apparently a power control module failed and caught fire, and some servers didn’t have backup power available. That sounds like under-investment, in an industry which has been cutting costs and working on the edge of bankruptcy for years. 

Even when investments are adequate there will inevitably be some failures. These are complex systems, effectively a combination of humans and machines, and authorities on risk suggest they will fail at some point in 150,000 to 200,000 hours.

I’ve heard it said that whenever a data center fails, it’s usually the backup power, or some other failsafe device. Of course, that might be because the failsafe systems work so well that they are the only failures you actually see!

As well as under-investment, the other reason for over-stretching resources is when the organization makes an efficiency drive, striving to do more with less.

Web-scale providers can make the trade-off. If it’s mis-applied elsewhere, we might be in trouble 

I’ve heard it suggested that sometimes this desire to run more efficiently is actually reducing the resilience of data centers. It uses more resources if you over-chill and have twice as much power available as you need, but it makes your site more reliable. Cut down on the overprovisioning, and everything is operating much more tightly.

Watch for the signs

These trade-offs are made very consciously in the web-scale arena, where millions of identical servers are performing ant-like tasks. A few failures here and there, and jobs can be passed to a different server. Effectively, resilience is built into the IT, not the facility.

That’s fine at web-scale, but if that approach were to be mis-applied in non-webscale facilities, we might be in for trouble. We have to remember that web-scale infrastructure is designed to tolerate a lower reliability infrastructure, supporting a software architecture and a set of applications that can tolerate this.

I’ve heard people suggesting that web-scale benefits can be delivered on smaller-scale and enterprise-level systems. If this gets followed through, I think we should be watchful for signs that over-emphasizing efficiency is cutting reliability.

A version of this article appeared on Green Data Center News