The National Renewable Energy Laboratory (NREL) has provided an update on its ongoing exploration of using artificial intelligence for data center operations.
In partnership with HPE, the lab has been rolling out AI Ops across its Energy Systems Integration Facility, the world's most efficient data center.
Preparing for exascale data center management
When we talked to HPE about the effort last year, the project was still in its early days.
“We formed a team to do a very deep analysis and design of what is needed to build an exascale system that is really usable and operational in a real world environment,” Mike Vildibill, at the time HPE’s VP & GM of high performance networking, said.
“We need to manage and monitor this thing, we have to collect this much data from each server, every storage device, every memory device, and everything else in the data center. We've got to put it in a database. We've got to analyze it, and then we’ve got to use that to manage, monitor, and control the system.”
They found that the management of an exascale system would itself need a ~200 petaflops supercomputer to run. “Okay, so we’ve stumbled across a real problem," Vildibill said.
“We realized we needed AI Ops on steroids to really manage and control - in an automated manner - a big exascale system."
The company partnered with NREL on a three-year study to test AI Ops on its comparatively smaller supercomputers, with the 8 petaflops Eagle serving as NREL's flagship deployment.
Designed for 10MW, the current system capacity is 5MW, but typically consumes 2MW.
To get closer to the experience of a 1,000+ petaflops machine, the team expanded the data the site produced. “Like for example, if one sensor gives one data point every second, we want to go in and tweak it and have it do a 100 per second," Vildibill said.
"Not that we need 100 per second, but we're trying to test the scalability of all the infrastructure in planning for a future exascale system."
NREL’s sensors measure not only power consumption from IT equipment, but also metrics about network use, storage, various system components (such as temperature, pressure, flow rate, valve states, fan speeds), as well as external environmental conditions. Altogether, the system records one million metrics per minute.
Now, in a report funded by both the Department of Energy and HPE, NREL has detailed how that first year has gone.
"Such a large volume and velocity of data requires a system that can effectively handle millions of simultaneous data streams but is also resilient to downtime and lags in reporting," the paper states.
"The data architecture design for the collection of data in the ESIF data center, therefore, considers the data sources, data frequencies, the movement of data, and the eventual storage and use of the data. The goal of the data collection architecture is to provide a scalable infrastructure suitable for collecting, managing and processing streaming data from multiple heterogeneous data sources."
In June, the ESIF data center began using this data for anomaly detection. "In support of operational resiliency, the streaming data and analytics platform was initially deployed with a pipeline for detecting anomalies in the cooling infrastructure using historical and real-time data from the Eagle supercomputer and the ESIF data center," the paper notes.
All the data makes it hard for dashboards to capture all that is happening within a facility in a way that is consumable by humans. "This stems from the large number of simultaneous data streams that require monitoring as well as the compounding impacts of a large number of potential adjustments that can be made to nearly every device in the facility cooling system to achieve optimal system performance," the paper says.
"The [Advanced Computing Operations] team has also found that set-points, alarms, and dashboards are not always sufficient to identify anomalies in the system."
Previous outages and issues at the site were used to help the AI system, and highlighted the issue with relying on people to spot problems.
A problem with Cooling Distribution Units "went unnoticed for many months," with the data on what went wrong used to train the AI Ops system.
"In 2015, a 3-way valve failure that led to system shut down did not appear to be a high priority item to monitor but caused NREL to lose 20,000 node-hours in the shutdown," the researchers said.
"Motivated by this work, a key ongoing priority is automation around the monitoring and selection of sensors. This is a fundamental paradigm shift in how dashboards are built and used, allowing data center operators to monitor everything and focus on key anomalous events."
Along with anomaly detection, the streaming data architecture "has also allowed NREL researchers to investigate the power footprint of individual jobs on Eagle and their associated cooling resource requirements.
"Current research being undertaken at NREL in collaboration with HPE as part of the AI Ops project seeks to extend the use case of power usage prediction and to build a prototype implementation."
To help others carry out similar work, NREL has released a dataset of three months of job data and derived node level power metrics per job.
So far, the AI Ops system has not had a profound impact on the power usage effectiveness, with ESIF reporting a PUE of 1.06 - in line with what it usually reports, but below its best of 1.032 in 2017.
This year, the NREL AI Ops software is expected to begin doing predictive maintenance and more hands-on PUE optimization. Future updates in 2021/2022 will add root-cause analysis.
With the data gathered by the program, NREL also plan to develop a model for prediction or forecasting PUE weeks or months ahead.
"Taken together, these efforts will inform future supercomputer procurement efforts as to the type of resources that are used, how efficiently they are used, and ways that the NREL HPC community can improve its practices and help steer advancement in the design and widespread adoption of energy-efficient data center practices, significantly decreasing the carbon cost of leadership-class computing while also reducing maintenance costs and improving system reliability. "