The world’s most efficient data center wants to get even better, using artificial intelligence to eke out more compute power with the same electrical energy,
Building upon a wealth of data, the Energy Systems Integration Facility (ESIF) HPC data center hopes that AI can make its supercomputers smarter, and prepare us for an exascale future.
In search of perfection
Nestled amongst the research labs of the National Renewable Energy Laboratory campus in Colorado, the ESIF had an average power usage effectiveness (PUE) of just 1.032 in 2017, and currently captures 97 percent of the waste heat from its supercomputers to warm nearby office and lab space.
For the last five to ten years, researchers at NREL have used sensors to try to track everything happening in the facility, and within its two systems - the HPE machines Peregrine and Eagle. This hoard of data has grown and grown to more than 16 terabytes, just waiting for someone to use it.
A little under three years ago, Mike Vildibill - then VP of HPE’s Advanced Technologies Group - had a problem. He was in charge of running his company’s exascale computing efforts, funded by the Department of Energy.
“We formed a team to do a very deep analysis and design of what is needed to build an exascale system that is really usable and operational in a real world environment,” Vildibill, now HPE’s VP & GM of high performance networking, told DCD. “And it was kind of a humbling experience. How do we manage, monitor and control one of these massive behemoth systems?”
Vildibill’s team started with a brute force approach, he recalled: “We need to manage and monitor this thing, we have to collect this much data from each server, every storage device, every memory device, and everything else in the data center. We've got to put it in a database. We've got to analyze it, and then we’ve got to use that to manage, monitor and control the system.”
With this approach in mind, the group did a rough calculation for an exascale system. “They came back and told me that they can do it, but that the management system that has to go next to the exascale system would have to be the size of the largest computer in the world [the 200 petaflops Summit system],” he said: “Okay, so we’ve stumbled across a real problem.”
At the time, Vildibill was also looking into AI Ops, the industry buzzword for the application of artificial intelligence to IT operations. “We realized we needed AI Ops on steroids to really manage and control - in an automated manner - a big exascale system,” he said.
To train that AI, his team needed data - lots and lots of data. Enter NREL. “They have all this data, not just for the IT equipment, but for what we call the OT equipment, the operational technologies, the control systems that run cooling systems, fans, and towers, as well as the environmental data.
“We realized that that's what we want to use to train our AI.”
Armed with a data set with a whopping 150 billion sample points, Vildibill’s team last year announced a three year initiative with NREL to train and run an AI Ops system at ESIF.
“Our research collaboration will span the areas of data management, data analytics, and AI/ML optimization for both manual and autonomous intervention in data center operations,” Kristin Munch, manager for the data, analysis and visualization group at NREL, said.
“We’re excited to join HPE in this multi-year, multi-staged effort - and we hope to eventually build capabilities for an advanced smart facility after demonstrating these techniques in our existing data center.”
Vildibill told DCD that the project is already well underway. “We spent several months ingesting that data, training our models, refining our models, and using their [8 petaflops] Eagle supercomputer to do that, although in small fractions - we didn't take the whole supercomputer for a month, but rather, we would use it for 10 minutes, 20 minutes here and there.
“So we now have a trained AI.”
Train, train, and train again
The system has now progressed to a stage, Vildibill revealed, that it can “do real time capturing of the data, put it into a framework for analytics and storage, and do the prediction in real time because now we have it all together.
“We did 150 billion historical data points. Now we're in a real time model. That’s the Nirvana of all of this: Real time monitoring, management and control.”
But, for all its value, data from Eagle and the outgoing 2.24 petaflops Peregrine can only get you so far. Exascale systems, capable of at least 1,000 petaflops, will produce a magnitude more data.
“The next steps we're doing within NREL is just to bloat or expand the data that they're producing,” Vildibill said. “Like for example, if one sensor gives one data point every second, we want to go in and tweak it and have it do a 100 per second. Not that we need 100 per second, but we're trying to test the scalability of all the infrastructure in planning for a future exascale system.”
All this data is ingested and put into dashboards that humans can (hopefully) understand. "I could literally tell you 100 things we want on that dashboard, but one of them is PUE and the efficiency of the data center as a result."
As an efficiency metric for data centers, PUE has some detractors, but it’s good enough for NREL. “That's what NREL cares about, but we're building this infrastructure for customers who have requirements that we don't even yet know,” Vildibill said.
He noted that the system "might do prediction analysis or anomaly detection," and we “can have dashboards that are about trying to save water. Some geographies like Australia worry as much about how much water is consumed by cooling a data center as they do about how much electricity is consumed. That customer would want a dashboard that says, how efficiently they are using their data center by the metric of gallons per minute that are being evaporated into the air.
“Some customers, in metropolitan areas like New York, are really sensitive to how much electricity they used during peak time versus off hours because they've got to shape their workload to try to minimize electrical usage during peak times. Every customer has a different dashboard.
"That was the exciting thing about this program.”
The long-term plan
It’s still early days though, Vildibill cautioned, when asked whether the AI Ops program would be available for data centers that did not include HPE or (HPE-owned) Cray equipment. “That's a very fair question,” he said. “We're really excited about what we're doing. We're onto something big, but it's not a beta of a product. It is still advanced development.
"So the question you ask is exactly the very first question that a product team would and will ask and that is: 'Okay, Vildibill, you guys are on something big. We want to productize it. First question, is it for HPE or is it going to be a product for everybody?’ And I don't think that that decision even gets asked until later in the development process.”
Alphabet’s DeepMind grabbed headlines in 2016 with the announcement that it had cut the PUE of Google’s data centers by 15 percent, and expected to gain further savings. It also said that it would share how it did it with the wider community, but DCD understands the company quietly shelved those plans as the AI program required customized implementations unique to Google’s data centers.
“I can tell you this - and I'm putting pressure on the future product team that's going to have to make these decisions - but everything I'm describing is entirely transferable,” Vildibill said.
“In fact, we envision this being something that could even be picked up by the hyperscalers. It would be very ready for use to manage cloud infrastructure, in addition to being used by our typical customers, both HPC and enterprise, that are running on-premises.
“What I'm driving with this design is entirely transferable that I think, if it's not, then you depreciate its value entirely.”