We have an almost mystical faith in the ability of artificial intelligence (AI) to understand and solve problems. It’s being applied across many areas of our daily lives and, as a result, the hardware to enable this is starting to populate our data centers.
Data centers in themselves present an array of complex problems, including optimization and prediction. So, how about using this miracle technology to improve our facilities?
This feature appeared in the July issue of DCD Magazine. Subscribe for free today.
Turning the AI inwards
Machine learning, and especially deep learning, can examine a large set of data, and find patterns within it that do not depend on the model that humans would use to understand and predict that data. It can also predict patterns that will repeat in the future.
Data centers are already well-instrumented, with sensors that provide a lot of real-time and historical data on IT performance and environmental factors. In 2016, Google hit the headlines when it applied AI to that data, in order to improve efficiency.
Google used DeepMind, the AI technology it owns, to optimize the cooling in its data centers. In 2014, the company announced that data center engineer Jim Gao was using the AI tech to implement a recommendation engine.
In 2016, the project optimized cooling at Google's Singapore facility, using a set of neural networks which learned how to predict future temperatures and provide suggestions to respond proactively,
The results shaved 40 percent off the site's cooling bill, and 15 percent off its PUE (power utilization effectiveness), according to Richard Evans, a research engineer at DeepMind. In 2016, he promised: “Because the algorithm is a general-purpose framework to understand complex dynamics, we plan to apply this to other challenges in the data center environment and beyond.”
The next step, announced in 2018, was to move closer to a self-driving data center cooling system, where the AI tweaks the data center’s operational settings - under human supervision. To make sure the system operated safely, the team constrained its operation, so the automatic system “only” saves 30 percent on the cooling bill.
The system takes a snapshot of the data center cooling system with thousands of sensors every five minutes, and feeds it into an AI system in the cloud. This predicts how potential actions will affect future energy consumption and picks the best option. This is sent to the data center, verified by the local control system, and then implemented.
The project team reported that the system had started to produce optimizations that were unexpected. Dan Fuenffinger, one of Google’s data center operators who has worked extensively alongside the system, remarked: "It was amazing to see the AI learn to take advantage of winter conditions and produce colder than normal water, which reduces the energy required for cooling within the data center. Rules don’t get better over time, but AI does."
According to Gao, the big win here was proving that the system operates safely, as well as efficiently. Decisions are vetted against safety rules, and human operators can take over at any time.
At this stage, Google’s AI optimization has one customer: Google itself. But the idea has strong backing from academia.
Stability matters
Humans, and simple rule-based systems can respond to any steady-state situation, but when the environment changes, they react in a “choppy” way - and AI can do better, because it is able to predict changes, according to DCD keynote speaker Suvojit Ghosh, who heads up the Computing Infrastructure Research Centre (CIRC) at Ontario’s McMaster University.
“We know it's bad to run servers too hot.” said Ghosh. ”But it's apparently even worse if you have temperature fluctuations.” Simple rules take the data center quickly to the best steady state position, but in the process, they make sudden step changes in temperature, and it turns out that this wastes a lot of energy. If the conditions change often, then these energy losses can cancel out the gains.
“If you have an environment that goes from 70°F to 80°F (21-27°C) and back down, that really hurts," said Ghosh.
Companies in data center services are responding. Data center infrastructure management (DCIM) firms have added intelligence, and those already doing predictive analytics have added machine learning.
“The current machine learning aspects are at the initial data processing stage of the platform where raw data from sensors and meters is normalized, cleaned, validated and labeled prior to being fed into the predictive modeling engine,” said Zahl Limbuwala, co-founder of Romonet, an analytics company now owned by real estate firm CBRE.
The move for intelligence in power and cooling goes by different names. In China, Huawei’s bid to make power, cooling and DCIM smarter goes under the codenames iPower, iCooling and iManager.
Like Google and others, Huawei is starting with simple practical steps, like using pattern matching to control temperature and spot evidence of refrigerant leaks. In power systems, it’s working to identify and isolate faults using AI.
In its Langfang data center, with 1,540 racks, Huawei has reduced PUE substantially using iCooling, according to senior marketing manager Zou Xiaoteng. The facility operates at around 6kW per rack with a 43 percent IT load rate.
DCIM vendor Nlyte nailed its colors firmly to the DCIM mast in 2018, when it signed up to integrate its tools with one of the world’s highest profile AI projects, IBM’s Watson.
Launching the partnership at DCD>New York that year, Nlyte CEO Doug Sabella predicted that AI-enhanced DCIM would lead to great things: “The simple things are around preventive maintenance,” he told DCD. “But moving beyond predictive things, you’re really getting into workloads, and managing workloads. Think about it in terms of application performance management: today, you select where you’re going to place a workload based on a finite set of data. Do I put it in the public cloud, or in my private cloud? What are the attributes that help determine the location and infrastructure?
“There’s a whole set of critical information that’s not included in that determination, but from an AI standpoint, you can contribute into it to actually reduce your workloads and optimize your workloads and lower the risk of workload failure. There’s a whole set of AI play here that we see and our partner sees, that we’re working with on this, that is going to have a big impact.”
Amy Benett, North American marketing lead for IBM Watson IoT, saw another practical side: “Behold, a new member of the data center team, one that never takes a vacation or your lunch from the breakroom.”
DCD understands the partnership continues. The Watson brand has been somewhat tarnished by reports that it is not delivering as promised in more demanding areas such as healthcare. It's possible that this early brand leader has been oversold, but if so, data centers could be an arena to restore its good name. The vital system of a data center is much more simple than the human body.
The next stage
It's time for AI to reach for bigger problems, says Ghosh, echoing Sabella's point. After the initial hiccups, efforts to improve power and cooling efficiency will eventually reach a point of diminishing returns. At that point, AI can start moving the IT loads themselves:
“Using the cost of compute history to do intelligent load balancing or container orchestration, you can bring down the energy cost of a particular application,” Ghosh told his DCD audience. This could potentially save half the IT energy cost, “just by reshuffling the jobs [with AI] - and this does not take into account turning idle servers off or anything crazy like that.”
Beyond that, Ghosh is working on AI analysis of the sounds in a data center. “Experienced people can tell you something is wrong, because it sounds funny,” he said. CIRC has been creating sound profiles of data centers, and relating them to power consumption.
Huawei is doing this too: “If there is a problem in a transformer, the pattern of noise changes,” said Zou Xiaoteng. “By learning the noise pattern of the transformer, we can use the acoustic technology to monitor the status of the transformer.”
This sort of approach allows AI to extend beyond expert human knowledge and pick up “things that human cognition can never understand,” said Ghosh.
“In the next 10 years, we will be able to predict failures before they happen,” said Ghosh. “One of my dreams is to create an algorithm that will completely eliminate the need for preventative maintenance.”
Huawei’s Xiaoteng reckons there are less-tangible benefits too: AI can improve resource utilization by around 20 percent, he told DCD, while reducing human error.
Xiaoteng sees AI climbing a ladder from level zero, the completely manual data center. “On level one the basic function is to visualize the contents of the data center with sensors,and on level two, we have some assistance, and partially unattended operation,” where the data center will report conditions to the engineer, who will respond appropriately.
At level three, the data center begins to offer its own root cause analysis and virtual help to solve problems, he said. Huawei has reached this stage, he said: “In the future, I believe we can use AI to predict if there's any problem and use the AI to self-recover the data center.”
At this stage, DCIM systems may even benefit from specialized AI processors, he predicted. Huawei is already experimenting with using its Ascend series AI processors to work in partnership with its DCIM on both cloud and edge sides.
Right now, most users are still at the early stages compared with these ideas, but some clearly share this optimism: “Today we use [AI] for monitoring set points,” said Eric Fussenegger, a mission critical facility site manager at Wells Fargo, speaking at DCD>New York in 2019, adding to DCIM and “enhancing the single pane of glass.”
AI could get physical, further in the future, said Fussenegger, in a fascinating aside. “The ink is not even dry yet, maybe it hasn't even hit the paper.” he said, but intelligent devices could play a role in the day-to-day physical maintenance and operation of a data center.
One day, robots could take over "cleaning or racking equipment for us, so I don’t have to worry about personnel being in hot and cold aisle areas. There are grocery stores that are using AI to sweep.”
Even these extreme views are tempered, however. Said Fussenegger: “I think we’re always going to need humans in there as a backup.”