When it comes to data centers the word ‘resilience’ can be best defined as ‘the ability to maintain ICT service in the face of environmental extremes as well as human error or deliberate sabotage’ and, generally, higher levels of resilience can be engineered into the mechanical and electrical infrastructure at a cost premium.

However, ‘human error’ is well documented to be the root cause of at least 70 percent of all data center ‘failures’ but even that can be reduced by design e.g. a dual-bus power system with a UPS in each bus can largely protect a correctly connected dual-corded load against power failure, human error and inept sabotage - but you probably notice how careful I am with the caveats…

Numbers can mislead

Network cable failure server crash error outage threat
– Thinkstock / AKodisinghe

Of course, if you are a client/user of a data center you clearly want to know what you are getting for your money, not least so that you can pay for what you deserve but, In the wonderful words of John Ruskin (1819-1900) ‘There is nothing in the world that some man cannot make a little worse and sell a little cheaper, and he who considers price only is that man’s lawful prey’. In modern parlance ‘if you pay the lowest price you are usually buying rubbish’.

So, how to differentiate between systems? Well we have two ‘metrics’, somewhat interlinked and both abused;

  • The ‘Tiers’ of Uptime (I-IV) or the ‘Types’ of TIA-942 (I-IV), the ‘Rating’ of BICSI (0-4, although ‘0’ doesn’t describe a data center, so 1-4) and the ‘Availability Class’ of EN50600
  • Availability percentage, e.g. 99.999 percent (the so called ‘five-nines’)

Apart from pointing out that only the Uptime can give you a Tier Rating, TIA-942 & BICSI are ANSI Standards most applicable in north America and that EN50600 isn’t yet used much we can distil them all into ‘four levels describing the capability of ‘concurrent maintainability’ & ‘fault tolerance’. The principles are clear; concurrent maintainability answers the question of what is the point of building a hugely reliable (and maybe resilient) data center that must be shut-down once a year for maintenance? Whilst a fault tolerant system can have any component, path or space ‘fail’ (one at a time) without impacting the ICT service.

But the greatest abuse is reserved for Availability percentage; easy to calculate but capable of huge misinterpretation to fool the unwary. Caveat emptor. The first problem is that to state an Availability you need just two numbers, the MTBF (mean time between failure, hours) and MTTR (mean time to repair, hours) and you simply express the Availability by dividing the MTBF by the total time (MTBF+MTTR) and multiplying by 100 percent.

So, having a very long MTBF and a very short MTTR gives you an incredibly high result. Unfortunately, both MTBF & MTTR are numbers that marketing departments can just guess at, if they use them at all. For example, you can quote 99.999 percent for a UPS simply by assuming that the client has the skills and spare parts on site and can repair it himself in 20 minutes, instead of calling the service engineer, waiting for spare-parts and then re-testing before putting back into service (often one day or longer). Guessing an MTBF (say at 100,000h, a little under 12 years) and the playing with the MTTR (between 20 minutes and 12 hours or more) can produce any result you want.

The second problem is a combination of the number of failure events (summing multiple MTTRs) and the MTBF. The original Uptime white paper (now withdrawn) had an attempt at linking Availability percent with the four Tiers but didn’t define the period over which it would be measured. This led to the strange scenario where a low Tier facility would offer to be off-line for 53 minutes per year but the ultimate ‘IV’ would offer only 5.3 minutes. How bizarre was that? A failure once a year is a disaster, for any ‘Tier’.

Heart skips a beat

Anyway, let’s not dwell on that but consider the combination problem. This particularly impacts numerous very short-lived failures. The easiest way to illustrate it is to suggest that your heart is 99.9 percent ‘available’. Doesn’t sound ‘too’ bad until you consider that it represents 36,000 missed heart beats a year and that if they are missed in one session you are very dead whilst if they are evenly spread over the year you are just feeling unwell. In data center terms look at the voltage supplied to the load. Many modern servers cannot withstand a break in supply for longer than 10ms (millisecond), and some considerably less at 6ms, so offering a 99.9999999 percent Availability in the power system (9-nines) could still produce three 10ms failures every year.

So what to do? Well, there is nothing wrong with Availability as a metric as long as it is ‘clear’ what it is based upon. For example; ‘an Availability of 99.99 percent measured over 10 years with a single failure lasting no longer than 10 hours’ is a clear statement of MTBF (10 years) and MTTR (10 hours). OK, the marketing boys and girls may have rounded the answer from 99.98859… percent but you may, by now, be getting the point that it is the MTBF that is more important than Availability and, to boot, you need the MTBF to calculate the Availability in first place. The ‘single failure’ caveat avoids summation of multiple events.

The next time someone offers you 99.999 percent of anything just ask them ‘over what period’ and watch their expression change – it can be fun.

Of course, the ultimate ‘failure’ of a resilient data center is the easiest to achieve: It is not hacking into the UPS over the internet and turning off the power or (as in a recent movie) raising the server inlet temperature to get melt-down. No, just consider the definition of a data center: a facility housing compute, storage and I/O connectivity, right? So, walk round the outer perimeter of the property noting the location of the fibre pits and return later that night with a few chums each in a white van and armed with a balaclava, a few gallons of unleaded and a box of matches. Grenades would be better but my local garage doesn’t sell them.

Whip up the cast-iron pit lids, dump the petrol and, like it says on the firework boxes, ‘light and quickly retire’. Within seconds you are fleeing the scene in multiple directions and the data center is disabled for several days. The same principle applies to those strange folks who want to build an earthquake-proof facility. If the earthquake hits your location it will almost certainly sever the fibre and, without connectivity, a data center is reduced to a secure depository for second-hand ICT kit and out-of-date data sets…

Ian Bitterlin is a consulting engineer at Critical Facilities Consulting Ltd and a visiting professor to the University of Leeds, School of Mechanical Engineering. He also coaches data center staff at DCPRO