In a recent research report, Uptime Institute concluded that downtime in data centers is common and may even be increasing, despite many advances and much effort and investment. Unfortunately, management often attributes failures of complex systems like data centers to human error on the part of operators in the field, when the reverse is often true: management shortcomings are usually the root cause.

In their 2018 book Meltdown, Chris Clearfield and András Tilcsik break new ground in exploring the causes of management failures that lead to apparent human error, including solutions at the C-suite or board level that many engineers and executives alike may find counterintuitive. This book should be essential reading for the CEO, CIO, and HR departments at any organization operating significant IT, as it points the way to solving numerous risk issues and avoiding potential incidents that can have significant costs and consequences.

Outsiders know best

GettyImages-97942558.jpg
– Getty Images

The authors base their ideas on the social sciences rather than on business or engineering curricula, including the scholarship of Charles “Chick” Perrow, a sociology professor emeritus at Yale University. As part of a presidential committee investigating the nuclear disaster at Three Mile Island, Perrow recognized, as does Uptime Institute, that the accident—and others like it--could not be blamed on any individual, rather it was a management or organization problem.

Clearfield and Tilcsik argue for the inclusion of the outsider in risk assessment: individuals who are not constrained by prior participation in a project and who can independently evaluate a project without regard for team loyalty, personal attachments, internal pressures, and even budget constraints—factors that can lead to compromises that can cause a project to fail.

Some data center operators might recognize that Uptime Institute’s Tier Certification and M&O Stamp of Approval as meeting the criteria for outside, independent validation of project plans and construction. Our FORCSS decision-making methodology, too, helps companies look past their organizational limitations with Uptime Institute consultants, “outsiders,” helping organizations see the forest for the trees.

Racial and gender diversity is just one path to eliminating ‘sameness’ within leadership teams. Including industry outsiders with different academic and professional backgrounds and perspectives also increases productive conflict, which is needed to unearth genuine problems.

In addition, reducing an over concentration of experts by incorporating more diverse perspectives makes executive management skeptical of its own abilities and less willing to take. People, it seems, no matter their station in life, tend to defer to experts and trust those they most identify with. Yet people of different backgrounds and experience levels are more likely to challenge each other’s assumptions, so issues are aired more thoroughly.

The authors’ positions contrast with our industry’s status quo, according to data from Uptime Institute’s 2018 global data center survey of operators: 70% of respondents said that the lack of women in the sector’s workforce is not a threat to their businesses or the industry at large, even as industry leaders say they struggle to find new hires.

According to Uptime Institute research, the data center industry is increasing adopting technical solutions to address, at least partially, downtime concerns: DCIM software and hybrid IT approaches being two prominent examples. However, Perrow noted that automation and computerization can also obscure what’s really going on and lead to missteps during an emergency. In highly automated facilities, operators can no longer visually confirm that an operation has taken place, rather they must rely on indicators that may be misleading.

Today we know that the causes of the Three Mile Island accident were trivial: a combination of small failures—a plumbing problem, a stuck valve, and an ambiguous indicator light that caused the system to run amok in 13 seconds and damage the nuclear core in less than 10 minutes, but it was impossible for the people on site to see what was really going on. Similar sequences of events can be found in many data center incidents, which are then attributed to human error.

As a result, operators of mission critical IT must perform as though they have three complex, tightly-coupled systems to manage: the data center, brand reputation, and the product or service offered by the company. Consider, for example, the effect of data center failure on airlines. After a data center fails, flights are grounded and the airline’s brand (and bottom line) suffers as a result.

Soon after, the airline announces that a fire or accident during testing or some technical oversight by the operator caused the initial incident. Few ask, why should these events, by themselves, bring IT operations to a halt? Were procedures followed? Were IT budgets adequate? Was the risk properly evaluated during the data center’s design, construction, and commissioning phases? Why was IT and executive leadership so certain that spending was properly directed?

Clearfield and Tilcsik write, “Diversity is helpful not so much because of a unique perspective that minorities or amateurs bring to the table but because diversity makes the whole group more skeptical.”

They draw evidence to support their conclusions from many industries. They note that a board consisting solely of older men, including Henry Kissinger, Bill Perry, George Schultz, James Mattis, and Riley Bechtel, failed to detect the rampant fraud at healthcare tech firm Theranos. Today that company is out of business because of the scandal that ensued.

The airline industry, they note, introduced a concept called crew resource management that addressed a disparity between captain and co-captain, when airline data showed that more incidents took place when the captain was in control. crew resource management addressed the hesitancy of less experienced officers to challenge the captain, which reduced the effectiveness of flight double checks when the captain had command. The more balanced protocol made it necessary for both officers to think through the problem before deciding on a course of action.

Greater diversity is just one part of the solution to a hard-to-see problem. Without sufficient numbers, the outsider–be it the under-represented minority, the non-manager, the non-engineer or the non-banker, or the newcomer–may be overlooked or disregarded. Leadership must be expanded to include these perspectives.

The entire diversity argument does not rely simply on numbers. Leadership qualities matter as well. Employees, and even executive leaders, respond negatively when they work in a fearful environment. Even, the anonymous suggestion box can communicate that management is not really open to new ideas. Similarly, the leader who opens meetings by offering his preferred decision is less likely to generate productive conflict that might lead to better ideas.

IT veterans and upper management may chafe at the suggestions in Meltdown, but the authors’ assessment of traditional diversity efforts might cause some human resources officers to frown as well. While continuing to advocate for increased diversity and productive conflict, Clearfield and Tilcsik cite studies showing that formal mentoring programs, tracking (but not mandating) diversity hiring, and role rotation are the path to greater skepticism—and, therefore, better project management and organizational decision making.

The approaches outlined by the authors would, we believe, help the data center industry recognize behaviors identified by the 2018 Uptime Institute’s Data Center Survey as corporate failings and would go a long way to enabling the industry resolve its most important issues.