Fully connected and always-on. This is the digital lifestyle that most people now seem to follow in both personal and work domains. Given that we never want a moment offline, the infrastructure that underpins this reliance becomes all the more important. It may not be front of mind for many consumers, who expect their digital products and services to just ‘work’, but you can be sure they’ll be the first to complain when something does fall over.
Take the recent aerospace data center outages of Delta, Southwest and British Airways as prime examples. In each of the above incidents, a simple electrical failure or incorrect maintenance procedure led to hundreds of millions in company losses, catastrophic server damage, and the stranding of tens of thousands of passengers at airports around the globe.
These large scale outages will always dominate the press, but incidences of downtime are more common than you might think. According to the Uptime Institute’s seventh annual Data Center Industry Survey, 25 percent of organizations surveyed experienced a data center outage in the last 12 months, either on their own premises or at a service provider’s site. And, 90 percent of data center and IT pros say their corporate management is more concerned about outages now than they were just 12 months ago.
Evidently, not every outage is as damaging or public as the BA incident - however there is clearly still some confusion to the financial implications of downtime, as again, according to the Uptime Institute survey, only 60 percent of organizations actually measure the cost of downtime as a business metric. In 2017, this is something all organizations large and small should be doing. Having a financial figure in mind for each minute or hour of downtime can go a long way to keeping adequate infrastructure resilience front of mind for IT professionals and for those assigning budgets towards its upgrade and upkeep.
Of course, being aware of data center risk and actually taking proactive steps to accurately predict potential resilience issues before something does go wrong, are two very different things. So how can organizations safeguard themselves against downtime, and how do they limit damage when it occurs?
The definitions of efficiency
Every data center manager wants their site to be efficient. Efficiency from an operational standpoint is where the power and cooling supplied to the facility allows it to meet IT demand, without incurring needless costs. From a more commercial standpoint, a data center must have the ability to maintain this balance whilst being flexible to the needs of the business. This means infrastructure, compute power and performance needs to scale effectively, often, and with no risk of downtime.
However, for the majority of data centers today, the performance impact of changes to the DC environment, such as new technology roll-outs, are not factored in. For IT teams, aside from knowing that their deployments will utilize a well-defined amount of space, network and power, they often have little understanding (or care) for the impact their changes will have on the data center environment. This is the responsibility of the facility manager. He or she must react immediately if IT provisioning has had any negative repercussions on the data center’s effectiveness to safely house all of the IT.
The issue lies in the fact that both of these teams are currently operating independently. Many organizations have deployed DCIM technology with a goal of crossing the data and process gaps that are found within a business regarding its data centers. This is a positive step, but it doesn’t cover all bases.
Simulation against every eventuality
Imagine being able to accurately predict how any change, from installing a single blanking plate on a rack, to increasing the power of your facility by 300kW, will affect your data center’s resilience.
This is a reality, and it takes the form of engineering simulation - allowing facilities managers to experiment in a safe offline environment by creating virtual prototypes, troubleshooting existing designs and analyzing what-if scenarios for future data center configurations.
This means when the demands of the business come flooding in, resulting in huge variation in the workloads being handled, the data center can perform with absolute resilience. Or, those demands can be curtailed until infrastructure alterations can be made. Change should no longer be the enemy. And downtime can be eliminated, or mitigated to non-harmful levels.
From a continuity standpoint, other simulations can also be run, for example - should a power failure kick in and battery backup takes over, will any critical systems go offline? Should an engineer not follow correct protocol when rebooting a power system, will it have an adverse effect on the data center? If so, how can this be mitigated in a way which doesn’t cause more damage? All these questions and more can be answered through simulation - helping data center managers to create strategies in which critical hardware can be positioned in such a way to ensure that it is the last to fail.
If 90 percent of data center and IT professionals say their corporate management is more concerned about outages now than they were just 12 months ago, operational resilience is something which should be held in utmost regard by both IT and Facilities. And with the aforementioned strategy and tools, it can be.
And as for the remaining 10 percent of management who aren’t concerned? They either know their teams have safeguarded against any and all eventuality, or they’ll soon change their tune when the next incident of downtime happens to their organization. Then you can bet they too will understand the massive impact on their company’s reputation, and bottom line.