We have spent years analyzing and classifying the causes of downtime in data centers, but the subject is now a burning issue.
The recent case of British Airways, with 75,000 passengers affected and more than £100 million in direct losses, joins a number of other events in recent months which present a surprising panorama. In some sectors, state-of-the-art technology is exposed to obvious risks by the more “traditional” part of the installations: the electromechanical component that supports the operation of IT equipment and systems.
Not even nine months ago, another airline company (Delta Airlines) also suffered an outage as serious as the one suffered by BA, which, coupled with recent and repeated non-technical events on United Airlines. put the sector in the focus of all eyes.
It’s not just planes
But aviation data centers are no exception, since other sectors suffer the same problems. In recent months we have seen plenthy of outages. In February, Amazon Web Services (AWS, US East), Telstra (Australia) and Nielsen (Oldsmar Floriada) went down, and in March Microsoft (US East) and Microsoft Azure (Osaka and Global) suffered..
The latest report published by Ponemon dates from the beginning of 2016 and picks up to 63 cases of downtime in the USA alone in a year.
The causes of downtime are diverse and vary depending on the source being consulted. According to Ponemon, one in four failures falls is attributed to UPSs (and their associated batteries), 22 percent to cyberattack, a value that has grown exponentially since Ponemon started this type of analysis, 22 percent to human errors, 11 percent to mechanical matters and 6 percent to the emergency plant.
Within PQC’s own statistics, we see an important trend in recent months towards the ultimate responsibility of generators. In most of the events discussed above, almost inevitably the generator appears at the end of the succession of errors and is ultimately responsible for the loss of continuity in service.
Apart from cyberattacks, it seems that 80 percent or more of incidents have electrical causes at the root of the problem. Within that, depending on the type of data center, there are a number of differences worthy of mention.
Old redundancy designs
In the Ponemon report, 25 percent of failures are attributed to the UPS, which seems to indicate that the sample from which the data was obtained has a rather poor design topology. In data centers where the UPSs are configurdd with distributed redundancy, it is frankly difficult (though not impossible) for a UPS failure to cause a total drop. Therefore, among the cases included in this analysis, it is very likely that we will find a majority using the older parallel redundancy topologies. This design was the state of the art in the 1990s and still supports many data centers. In our experience, reconfiguration of parallel redundant to distributed redundant topologies has been a constant for quite a few years.
Without taking away from the rigor and precision of the Ponemon analysis, the reality of the cases mentioned, and the experience of the current author, is quite different. This personal observation starts from two events that took place five and six years ago respectively. Both involved Amazon, both were different from those previously related, and in both cases, the backup system was the cause of the failure.
I remember that the explanation for the first one was something like: “The source of the problem is the lightning strike that hit a transformer near its data center, causing an explosion and a fire. The service stopped working and it was not possible to start the generators… As the power company was unable to provide services, this meant that generators could not be started.”
And from the second: “It was a rainy Monday in August. A 10-million-watt transformer exploded in northern Virginia, sending a huge voltage spike through the power grid that reached an Amazon data center in Ashburn, Virginia, leaving no power to the data center.”
Last year’s outages were caused by the system of last resort - the safety net that should protect the acrobat and which really must not contain holes.
Well, it seems that in both cases, even if the first explanations were diverted to events less than paranormal, the final responsibility fell on the side of the backup system. On one occasion, it did not start, and in the other an incorrectly adjusted switch created an overload. That is, something like “more of the same”.
But if we look at last year’s cases, Delta, Microsoft, Nielsen and, most likely, British Airways complete a long list of victims of the system of last resort - the safety net that should protect the acrobat and which really must not contain holes.
You must test!
In the same vein, testing is not a widespread culture. We have known many cases where it was even forbidden to carry it out in the light of the usual results and where, at the very least, obtaining authorization was little less than an unusual achievement. In such cases, the time when safety systems have to prove their worth coincides with when they are really needed and an error triggers a complete fall.
The recommendation must be that those who have the power to make these decisions should not only authorize, but also encourage the performance of tests under real operating conditions and at times where the primary system is perfectly operational. Only then will we have certain guarantees that the enormous investment in equipment and backup systems and redundancies, is fully justified.
The causes of downtime in the data centers have been classified into four large groups:
- Those that have their origin in an incorrect design,
- those related to the execution,
- those that have to do with the operation and maintenance of the installation and
- those directly associated with a failure in the product.
Regardless of the passage of time, the percentages are maintained, especially in the case of operation and maintenance, where the incidence of human error is more likely.
In general, in order to create a catastrophic failure, the installation usually has to suffer from two things; the lack of a suitable topology and a faulty operation. To be more precise, the second is more prevalent than the first. Human error at the bottom of it. Even the most sophisticated topologies fall. In fact, although a Tier IV site aims for total availability, it is common to attend service losses in environments of this or similar level and the results of the analysis do not usually leave room for many doubts.
There is one more circumstance that we raise repeatedly. It is the cycle of failure, that process by which, unfailingly, the conditions that produce a downtime will occur again with a certain periodicity that we can figure at between eight and 13 years.
Above all, there is a Narcissistic feeling of having everything under control, that makes us think only about how well we are doing, and so makes us deviate from the first rule for the chairman of any data center: always keep your guard high, always have a reasonable doubt that everything is in order, and always make sure it is perfectly and regularly tested.
Garcerán Rojas is chairman at PQC, a Spanish data center engineering firm