Cookie policy: This site uses cookies (small files stored on your computer) to simplify and improve your experience of this website. Cookies are small text files stored on the device you are using to access this website. For more information on how we use and manage cookies please take a look at our privacy and cookie policies. Some parts of the site may not work properly if you choose not to accept cookies.


Causes of failure

Data centers are failing too often because the root causes of those failures are being kept secret

Data centers are engineered for high reliability, but all too often they go wrong, losing data and losing money for their customers.

The reasons aren’t hard to see. The facilities are complex systems depending on both technology and human activity, and will inevitably fail at some point. In most cases, the underlying root cause is human error, and increasing data center reliability is a matter of eliminating or reducing that wherever possible.

causes of failure right


An upper limit of reliability

For systems like this, there’s a theoretical upper limit of reliability, which is about 200,000 hours of operation. This is because the human factor can be made smaller and smaller, but there is eventually a point when hardware will fail, whatever is done to improve the systems and procedures. 

Well-established industries such as aviation are close to achieving the maximum possible reliability but, according to Ed Ansett of i3 Consulting, data centers fall short of it.

Why is this happening? According to Ansett, it’s down to secrecy. Failures repeat because they aren’t widely understood. Ansett presented his ideas in detail at the DCD Converged SE Asia event in Singapore last month.

Virtually all failures in complex systems are due to errors in the design, testing, maintenance or operation of the facility. Once a failure has occurred, it should be examined to understand and determine the root cause of the problems. Once the fundamental issues are identified, then it is possible to make changes to reduce the chances of the same failure happening again.

In the case of data centers, most root causes are down to human error – whether it is in the design phase, installation, maintenance or operation. Some potential faults are obvious, or at least easy to identify, such as generators failing to start, or leaks of water. But very often failures occur through a combination of two or three faults happening simultaneously, none of which would have caused an outage on its own.

In aviation, for example, these complex faults are often uncovered because, when a plane crashes, there is a full investigation of the cause of the accident, and the results are published. This is a mandatory requirement. When a data center fails, there is an investigation, but the results are kept secret, as the investigators sign a non-disclosure agreement (NDA).

Time for regulation?

Airlines share their fault reports because they are forced to by law. Aviation is a heavily regulated industry, because when a plane crashes, lives are lost. Data centers are different. There’s no obvious human injury when a data center crashes, and there are no central regulators for the industry. Any failure report would reveal technical and commercial details of a data center’s operation, which its owner would want to keep as trade secrets, hence the NDAs.

When faults in data centers are investigated (see box for some recent scenarios), the analysts come up with the root cause and suggest improvements to prevent the same thing happening again. But each time the fault crops up, few people can learn from it, because the information is restricted.

There’s no obvious human injury when a data center crashes, and there are no central regulators for the industry. 

As a result, when investigators are called on to investigate a mystery fault, they often know instantly what has gone wrong – much to the shock of the data center operator.

At a recent fault investigation, Ansett only needed to know the basic details of the incident to make a guess at the root cause, which turned out to be completely accurate. It was a fault he had seen in several previous failures at data centers in other countries.

It was a failure that could have been predicted and prevented, but only if the results of previous failure investigations had been made public. Unfortunately, the information is not made widely available, which is a tragic waste of knowledge. It leaves operators designing and specifying data centers blindly: “Reliability is much worse than it need be, because we don’t share,” says Ansett.

It’s possible this may change in future, but the change might not be a pretty one. No industry welcomes regulation, and controls are normally forced on a sector in response to an overwhelming need, such as the desire to limit the loss of life in plane crashes.

causes of failure left

causes of failure left


Impact will increase

Data centers are becoming more central to all parts of society, including hospitals, and the Internet of Things, which is increasingly micro-managing things such as traffic systems, hospitals and the utility and systems delivering power and water.

As data centers are integrated into the life-support systems of society, their failure will become more critical.

“Over the course of the next few years, as society becomes more technology-dependent, it is entirely possible that failures will start to kill people,” Ansett warns. At this point, the pressure will increase for data centers to be regulated.

The only way to avoid this sort of compulsory regulation would be for the industry to regulate itself better and take steps to improve reliability before there are serious consequences of a fault.

There are some precedents within the technology industries. For instance, IT security issues and best practices are already shared voluntarily via the world’s CERTs (computer emergency response teams).

In the data center world, the Uptime Institute certifies data centers for the reliability of their physical plant in its Tier system, and is also looking at the human factors in its M&O Stamp.

But the industry is still groping towards a solution to the tricky problem of how to deal with the root causes of failure.

This article appeared in the October 2015 issue of DatacenterDynamics magazine

Failure modes

Three scenarios shared by Ed Ansett of i3 Solutions Group, at the Converged SE Asia event.

  • Diesel maintenance

A 7.2MW data center had four 2.5MW backup diesels, so there is one spare (N+1) in the event of a power failure. They were well fueled and maintained, except for a seal on the pneumatic starter, about the size of a 20-cent piece.

When the site lost power from the grid, its diesels started, but the starter on one failed, and then another lost power because of the seal. A fuller maintenance task list could have avoided this.

  • Out-of-synch flywheels

A site had two rotary UPS systems (flywheels) providing redundant backup. When a power failure occurred, the system switched to power from the flywheels. However, the frequency of mechanical flywheels can drift, and their output was combined at a static transfer switch. When the two flywheels were out of phase, the transformer coils saturated, and the site lost power.

The possibility of flywheel phase issues should have been designed for.

  • Fire-suppression crashes disks

A data center featured a fire-suppression system that would flood the data hall with inert gas when a fire was detected. Unfortunately, the sudden gas discharge produces a loud sound, and the associated pressure wave is enough to damage hard drives. When a piece of IT kit emitted smoke, the fire-suppression system kicked in and prevented a fire but killed the hard drives.

To avoid this, the nozzles should be baffled or placed away from the storage racks.

Readers' comments (8)

  • Ed Ansett gave a very good presentation and made a good case for this need at DCD-Singapore last month. others echoed this need at the event too.

    Unsuitable or offensive? Report this comment

  • Most of the data center power failures occur from lack of coordination between the circuit breakers within the distribution network. For instance, a millisecond difference in the delay settings of the upstream breakers can make a big difference.Hence, a detailed breaker discrimination study analysis should be mandated for better operational reliability and safety.

    Unsuitable or offensive? Report this comment

  • I don't think that any form of federal regulation is ever the answer. Can we as an industry do a better job at self-regulation, however? Yes, I certainly believe we can.

    Unsuitable or offensive? Report this comment

  • In my scope I have to deal with a mix of owned and multiple collocation DC's and I find differences in resilience testing, most sites now carry out a black building test monthly but a few select the yearly option. We now know outages are usually due to number of undetected faults so testing monthly must be the way forward to detect the minor faults before they become multiple faults and causing outages. It's also during the testing the engineers gain the knowledge and experience so when it does go wrong they know what they are doing limitating the human error. Monthly testing gives customers the comfort knowing it's tried and tested often, when its yearly it's normally a big thing and everybody is aware of it and the risk.

    Unsuitable or offensive? Report this comment

  • Spot on article Peter. There are rafts of failure stories to complement this article on the Availability Digest at under 'Never Again'. It is an eye-opener for those who think 99.99% availability is the norm. My favourite is the London Stock Exchange on March 1st 2000 when the update jobs ran in the wrong order. No reason was given but my suspicion is that of the 2 systems involved, one recognised the leap year, the other didn't, a classic Y2K problem.

    Unsuitable or offensive? Report this comment

  • Your points about the human fat finger being the primary cause of failure and the lack of transparency in these incidents is absolutely correct. Terry Critchley, in his excellent book on high availability entitled "High Availability IT Services," points out that service reliability depends upon people, products, and processes, with the emphasis on people. The LinkedIn Continuous Availability Forum has a pertinent thread entitled "Why Don't Companies Share Information on Outages?"

    Unsuitable or offensive? Report this comment

  • The FCC's Network Outage Reporting Systems (NORS) used to be publicly accessible. The Nuclear Regulatory Commission has it's Event Reporting system available for the public to browse.

    I'm not confident any industry can regulate itself with respect to information sharing- doing so is embarrassing, risky (someone might sue you), and not good business sense (you're helping your competitors).

    Unsuitable or offensive? Report this comment

  • I was reading the article by chance - our company see many avoidable problems and shut downs due to poor maintenance routines and a lack of knowledge of products on the market that can help maintain continuity. Every thing will perform better and for longer if you take care of it - humans included!

    Unsuitable or offensive? Report this comment

Have your say

Please view our terms and conditions before submitting your comment.



More link