Already 2020 is looking like it’s going to be one of the top five hottest years on record, with temperatures easily matching last year’s summer heatwave highs.
While that’s great for people wanting to share time outdoors with family and friends after a tough few months, the fact that the UK was hotter than the Bahamas last month should be ringing alarm bells for data centers. Peak temperatures inevitably bring thermal challenges, and as the UK’s ten hottest years on record have all occurred since 2002, data center cooling strategies clearly need to be ready for whatever summer brings.
Given that cooling issues still account for almost a third of unplanned data center outages, mitigating the impact of escalating temperatures needs to be part of any data center's risk planning approach. Unfortunately, most organizations still seem unaware of the thermal risks that can quickly place their data center operations in danger. Or perhaps they think they’re on top of the issue because they already have a solid BMS in place? With thermal issues now ranking as the second largest cause of data center loss of service, it’s critical for organizations to reduce this risk by optimizing their thermal performance.
Identifying the early warning signs
Even the most experienced data center operations teams can be surprised at just how little time it takes for thermal issues to flare up. Cooling plant failure can easily escalate into a thermal runaway situation in no time, transforming a room that’s operating normally into a site that’s got real problems.
At EkkoSense we’ve found that one of the key reasons for this is that incumbent solutions such as a BMS are not very effective at spotting thermal challenges in time. Thermal issues such as cooling, and airflow problems typically don’t trigger BMS alerts early enough as there is no hard SLA breach or fault. And when they do, it’s often too late to prevent SLA breaches. The result is that thermal issues can quickly escalate, creating localized data center hot spots that impact overall performance before the operations team has a chance to resolve the problem.
Don’t wait for alerts – taking a more proactive approach
That’s why pre-empting potential thermal failure is so important. By applying the latest AI and machine learning technologies, affordable software solutions now exist that can work in parallel with BMS systems to identify and manage away thermal risk from any data center.
With this kind of real-time thermal monitoring in place you can track cooling outputs and identify any poorly performing cooling systems in advance so timely improvements can be made. Granular rack and CRAC level monitoring is essential here for finding the hidden – but easy to fix – cooling and airflow problems that typical cooling PPMs and BMS systems fail to find or diagnose.
Thanks to ongoing development of our EkkoSoft Critical monitoring system, we’re now able to complete remote thermal risk prediction analysis of critical sites. In one recent example, our software and analytics capability was used to remotely identify abnormal thermal behavior, remotely diagnose the issue and recommend how its effect could be immediately mitigated. All this before any BMS system has even picked up the issue.
This video provides an illustration of how a predictive analytics-based approach can equip data centers with the early warning capability that’s needed to prevent outages. In this example, a data center with a normal and steady cooling duty profile very quickly became thermally unstable thanks to a failing CRAC unit. The timeline was as follows:
- EkkoSense software analytics initially identified abnormal CRAC behavior, drawing on performance data from its EkkoAir cooling duty sensor in the CRACs.
- Software identifies the individual ‘poor performing’ CRAC
- Software provided early warning of a localized hot spot due to the CRAC issue
- Software analytics also showed that other local CRACs, while still functional, were unable to remove the hot spot
- Software recommended switching off the failed CRAC to remove recirculating hot air. Once actioned, the hot spot issue was immediately resolved
- CRAC issue investigated and resolved, normal thermal operation was restored and confirmed by the software
At no point during this sequence did the incumbent BMS generate an alert as there was no specific component failure or alert threshold that was breached. This example shows how the software’s early risk detection analytics was able to identify and diagnose poor-performing cooling plant before its ultimate failure, so that any potential thermal risk was removed, and repairs could be planned in good time. It also illustrates how the lack of an alert generation by the BMS means that without additional predictive analytics the data center team would have remained unaware of the issue or where to look. By looking at the data center room as a whole, the EkkoSoft Critical analysis software was able to pick up on subtle changes – such as a set point change, a stuck valve or a grille that’s been moved – that could potentially lead to a broader thermal issue.
Early warning before boiling over
Unlike traditional BMS approaches that only generate alerts when a system has failed or a threshold breached, EkkoSense’s blend of highly granular sensing and EkkoSoft Critical real-time algorithms can highlight potential equipment failures before they have a chance to impact data center service availability.
It’s only by removing 100 percent of thermal risk from their data center operations and providing a stable platform for subsequent cooling optimization projects, that data center managers can truly achieve ‘thermal piece of mind’. And, thanks to EkkoSoft Critical’s real-time insights – as opposed to the theoretical modeling provided by other approaches – it’s possible for data center teams to unlock other initiatives such as a transition to more condition-based monitoring. This could deliver significant savings by moving away from more complex, ineffective and expensive maintenance schedules.