In Japan, on the very rare occasions that a train pulls-in late, the driver and guard will alight from the train and bow down low to apologize profusely to the waiting passengers for the extra minute they have had to wait. If they did that in the UK, rail workers would be retiring early in their thousands with bad backs.

Armin Forster_Pixabay_shinkansen-4761530_1920_EDIT.jpg
Japanese railway staff rarely have to apologise over late trains – Armin Forster, Pixabay

But in the data center sector, across the world, clients expect services to be even more reliable than Japan’s notoriously punctual railways, and given the complexity of even the most modest of modern data centers, that is a major challenge indeed.

Terry Rodgers is vice president of data center services firm Jones Lang LaSalle (JLL). An industry veteran with almost 40 years of experience in the mission critical sector, Rodgers has been called in to firefight and troubleshoot data center projects, disasters and downtime across the world. During that time, he’s seen pretty much everything that could possibly go wrong, and can tell within minutes of walking in whether a facility is well-run or leaves much to be desired.

“Not long ago, I was called-in regarding a chilled water pump that had suddenly exploded and disintegrated, and nobody could figure out why. It was an unusual situation.

“I did some forensics and, basically, we found that the pump was operating with both its suction and discharge manual isolation valves closed. The pump kept heating the water up until it got to maybe 200psi, and 380 degrees Fahrenheit [around 200 degrees Celsius], maybe more.

“When it finally cut loose, all of that hot water flashed to steam, and there was an explosion that threw a 300-pound motor 30 feet against the wall, and blasted a piece of pipe through another wall; it was catastrophic. The phenomenon is called a boiling liquid expanding vapor explosion, or BLEVE.

“When we went back and looked at the root cause, we found that the piece of equipment had been taken out of service for repairs and had never been properly put back into service. So the manual isolation valves were left closed, and there was an anomaly with a flow meter that had called the pump to run when it wasn't set-up to.

“Then there was a failure with the building management system, which was working, but it wasn't sending remote notifications, so nobody knew that the pump had been running or that there was anything wrong with it. Any one of these things would have caused a manageable problem, but because you had all of these little things occurring in a series, it ended up in catastrophic failure,” says Rodgers.

“I have done quite a few forensics, and I enjoy it even though it's unfortunate. But it's fascinating because you learn how, and more importantly, why things fail.”

While forensics might be fascinating, the cause of most data center downtime and other adverse events is typically more mundane.

“Often, the controls are the weakest link,” says Rodgers. “The reason for that is not that the controls aren’t robust and reliable, but that they’re black boxes.”

While the technician that sets them up and programs them will know the technology inside out, he says, the operations staff who oversee them will inevitably know a lot less. “To the facility’s staff who have to manage it, they’re little more than a box on the wall. They get their insights from the data on the data center’s management consoles.

“They can see the data, graphics and points, but they can’t see the logic and a lot of the settings and configurations that are behind all that information. And there can be latent, underlying issues waiting to get you,” says Rodgers. For example, they may perform day-to-day functions on an entirely automated basis, and work well until another element in the data center is changed or a firmware upgrade installed. “But as soon as there's a reconfiguration or another piece of equipment fails, and it rolls over to the redundant piece, the latent failure occurs.”

Furthermore, over time, he adds, more and more complexity has been added to these control systems including the integral controls on individual pieces of equipment, partly to keep up with increasing data center complexity, but also due to competition between vendors.

“As you add more complexity, you're adding more opportunities for human failure to occur. I've seen that personally in several situations where there was an anomaly, there was a forensic investigation, and the root cause was that, well, the box did exactly what it was programmed to do, it was just set up incorrectly,” says Rodgers.

Going nuclear

Rodgers started his working career, not in data centers – of which there were few at the time – but in the nuclear industry where ‘something going wrong’ can lead to more than just a spot of downtime.

“I worked for General Electric starting in construction and refuelling outages, and then got the opportunity to get into nuclear training. I taught the senior reactor operator certification course around the control room, simulators, etc., for large nuclear power plants. What I learned in that process is the importance of quality control and quality assurance with regards to building and running critical facilities, as well as the operational processes behind operating a plant,” he says.

The aerospace industry, government and even mortgage company Fannie Mae followed, bringing Rodgers into contact with data center technology, watching the industry learn the hard way as it developed. “In aerospace, we were doing 300 watts per square foot in the late 1990s, and the strategies and philosophies behind critical facilities with regard to redundancy had to be drilled all the way down into the controls to avoid single-points-of-failure. Equally important was discipline and adherence to procedures,” he says.

When Rodgers arrived at Fannie Mae, its data center was running at around 15 watts per square foot. “They thought that was great, but I learned of all the worst processes you could possibly see. So I introduced better processes, and radically updated the infrastructure.”

It is, perhaps, at Fannie Mae that Rodgers also learned some of the most important lessons on data center management and, especially, how to ‘nose out’ a well-run data center from a poorly run facility.

“I've now done site assessments around the world for many different clients. And I've seen the best in practice, and some of the worst. And, in general, I can walk into a site and within the first hour or two I know that there's either a culture of excellence, or there isn't.

“It may sound superficial, but you walk in, see everything in its place and everybody knows the answers to questions off the top of their head. They can hand you procedures, they’ve done drills, they’ve done training, the lights are all working, there’s no damaged insulation.

“All these things are superficial, but then when you peel it back you find that there’s robustness to the processes, the training and the people – the facility has established a culture of excellence.

“But in other cases, you walk in and there’s insufficient staff, and you can see deferred maintenance and superficial problems. Documentation isn’t controlled and therefore isn’t trusted. Well, if you can’t get the superficial stuff right, as you peel it back and start looking underneath you find that the major maintenance isn’t getting done either, staff attrition is high and new staff aren’t getting the training they need. There’s a lack of pride in ownership. All these issues can ultimately lead to the types of failures that can cause major downtime,” says Rodgers.

Indeed, the old cliché about ‘people, process and technology’ in data center management is more about ‘people, process and maintenance’. And not just maintenance of infrastructure, but of people and processes as well. “That also leads to what I call a culture of prevention, where you invest today in the processes, the people and the maintenance programs, and whatever it takes to ensure that you're always on top of your game.”

“It starts with leadership establishing high standards, and then allocating the resources to allow staff to meet those standards. If you're not doing that, you're constantly trying to just get by doing the best you can with the limited resources you’re given at any given time. Then, the little things start to line up and, eventually, you get that perfect storm where it's generally not one thing that fails: it's a little thing here that a guy wasn't trained on, so he took the wrong action, which took out some other piece of equipment because somebody else hadn’t put it back in service,” he says.

Of course, when something goes horribly wrong at a data center and a string of websites go down, the best an operator can do is publish a message of apology and an explanation on a website that is working, and hope the public understand, while working furiously to fix it.

But maybe a Japan railways-style apology would go down even better?