There is a very good reason why artificial intelligence (AI) has dominated the conversation in the data center industry for the past couple of years. The potential market value of artificial intelligence-generated content (AIGC) is almost incomprehensible in scale. This presents challenges as well as opportunities, as the move from conventional to AI-powered content is coupled with a massive upgrade in the world’s data center infrastructure.

Following the release of its latest report, “Top 10 construction principles for intelligent computing center facilities,” DCD spoke to Bob He, vice president of Huawei digital power and president of Huawei data center facility and critical power product line to explore the potential problems and solutions for the construction of data centers in this brave new world.

Data center construction
– Getty Images

Industry challenges

He cites four main challenges that the data center industry faces in the AI era. The first is security, which as He points out, is inextricably linked to the relative values of each server, with an AI server costing up to four times that of a conventional one. This raises the stakes significantly when finding ways to protect your system from bad actors.

As the infrastructure of data centers moves increasingly toward clusters, one infected node can have a direct impact on the rest of the cluster. As He explains: “A fault in a process may affect the entire cluster system. Therefore, security is the first and biggest challenge in the intelligent computing era.”

Second on He’s list is the massive increase in power consumption, which can be 50-100KW for a single cabinet. These high-powered arrays must not only have a constant electricity supply in an already congested market, but operators must navigate a significantly reduced response time during a system failure.

He explains: “As the power of a single cabinet increases to 100KW, the fault response time of devices is shortened to 10-30 seconds before failure. In traditional data centers with around 5KW per cabinet, if the cooling system fails, it takes five to ten minutes for the IT system to break down due to overheating. However, in intelligent computing centers with 50KW per cabinet, a cooling system failure leads to overheating and break down in just 10-30 seconds, giving Operations and Maintenance (O&M) personnel no time to respond.”

The extra power requirements demanded by AI also play a role in the data center footprint, with significantly more footprint required for power distribution than may have been allotted in the past.

The third issue also relates to power – specifically the increased energy consumption by AI compute. He says: “The power consumption of a mega cluster computing data center will reach gigawatt level, similar to the volume of a 220kV substation, which is, in turn, equivalent to the power consumption of a small city with a population of 400,000. This will be extremely difficult no matter the amount of investment and level of approval.”

For this reason, He emphasizes the importance of green, efficient power provision in data centers, citing small changes that add up. For a 500MW data center, which, whilst huge by today’s standards, Huawei believes will be with us, sooner than we think, reducing power usage effectiveness (PUE) by 0.1 units can yield cost savings of ¥200m ($1.3m) annually. With only a few data centers reaching even close to the “ideal” PUE of 1.0, these incremental changes can be transformational for power usage, sustainability, and the company’s bottom line.

The last of He’s four challenges is a more ethereal one – that of uncertainty. He explains: “With the acceleration of chip iteration, endless possibilities are brought to AI computing, but huge uncertainties are also brought to infrastructure construction.

“Over the past few years, the power consumption density of AI chips has increased from one generation in three years to one generation in one year, which is a huge challenge for us because we are not sure: Can the newly built data center adapt to the AI devices in the next two to three years?”

To illustrate his point, He points to Metcalfe’s Law, which states that the value of a communications network is equal to the square of the connected user base. On this basis, He argues, “Quickly seizing users has become the key to winning AI services.”

In turn, the key to this will be making sure that data center infrastructure expands fast enough to meet the demand as it comes in, and not with a waitlist for the required capacity. This puts pressure on data center constructors to prioritize fast delivery, with lead times shrinking from up to 24 months, to as little as six months. The challenges of providing materials, labor, permitting, and the many other aspects of turning a facility from blueprint to buildout mean that these increasingly short timescales put additional pressure on the industry.

Enabling AI

For He, there are three key elements to consider when looking at AI-enabled data centers. Reliability, which he describes as a “core competency” of data centers; Flexibility, referring to the architecture of facilities, based on the move toward modular arrangement; and Sustainability, the hot-button issue affecting decisions made both inside and outside of the data center.

Based on these three factors, Huawei’s recent report “Top 10 construction principles for intelligent computing center facilities” lays out the company’s vision for best practices in data center construction for the AI era. He discusses the ten factors, which can be summarized as follows:

Reliability

  1. Isolated energy storage: “For lithium batteries in data centers, Huawei proposes isolated deployment of energy storage systems, which can be deployed in a remote outdoor area so electrochemical energy storage is isolated from IT services. This maximizes safety and mitigates additional costs for fire extinguishing and emergency ventilation compliance.”
  2. Distributed architecture: “Electromechanical equipment should be deployed in ‘distributed mode’ to minimize fault domains. One power or cooling system can be deployed in a single container to prevent fault spreading and eliminate the impact on services.”
  3. Continuous cooling: “To cool high-density equipment, data centers should provide continuous cooling capability to prevent service interruption in case of abnormalities.”
  4. Highly reliable products: “Product quality should be managed comprehensively throughout design, materials used, production, testing, networks, and processes, delivering high-quality products to ensure end-to-end safety.”
  5. Professional services: “Professional and proactive services from deployment to maintenance eliminate risks and ensure that systems run reliably for a long term.”
  6. Smart management: “AI technologies can be applied in data centers to prevent faults such as power failure, fire, and overheating. AI will become a must-have technology for intelligent O&M.”

Flexibility

  1. Subsystem decoupling: “To address uncertainty in the intelligent computing era, subsystem decoupling can be adopted to meet basic requirements for flexible data center evolution. Key subsystems can be decoupled to implement on-demand construction.”
  2. Flexible architecture: “Data center standardization, modular design, and prefabricated modules enhance architecture flexibility, and therefore play a key role in flexible evolution, smooth capacity expansion, and quick delivery of intelligent computing centers.”

Sustainability

  1. High density and efficiency: “High-density deployment and high-efficiency design save space and energy for electromechanical equipment in intelligent computing centers. High density and efficiency will be important features of green and low-carbon intelligent computing products.”
  2. Air-liquid convergence: “Liquid cooling is the key to solving the heat dissipation problem for intelligent computing centers with high power density. Air-liquid convergence and an adjustable air-liquid ratio can reduce the total demand for cooling capacity, simplify delivery, and improve the energy-saving performance of systems.”

Best practice for batteries

With ongoing concerns about the safety of lithium batteries, we ask He what Huawei recommends as best practice for the deployment of such arrays. He tells us: “Huawei proposes the principle of isolated energy storage deployment. The preferred option is outdoor remote deployment, which separates electrochemical energy storage from IT services, maximizing data center safety.” He goes on to note that while interior deployment is possible, it does present extra mitigation factors, outlined above, that must be considered.

Generic Data Center Hall
– Getty Images

Huawei offers a modular solution for lithium battery deployment, which He outlines to us: “Huawei’s FusionPower9000 adopts a fully prefabricated modular design, highly integrated PowerPOD and lithium batteries into one container. It supports one power system per container, with outdoor deployment that does not occupy data center space and ensures the safe use of lithium batteries in data centers.”

Huawei’s FusionPower9000 offers several advantages over traditional power supply solutions, going beyond the obvious environmental benefits. Because they are prefabricated and pretested before dispatch, they can be quickly installed, bringing lead times down from an average of 28 weeks to just 18.

Because of their “decoupled architecture,” they offer the flexibility to expand externally without the need for increased footprint, as well as the option to expand as needs dictate. Huawei promises a high level of testing and construction quality in a high-protection container, to ensure reliability throughout their lifespan.

Data centers are cool

Finally, we address the issue of cooling, which has become synonymous with the debate surrounding data centers for the AI era. Because AI workloads create large amounts of heat in the servers, extra cooling must be deployed, which has the potential, particularly in the case of liquid cooling, to take up additional footprint, and suck yet more power.

He also reminds us of another challenge to consider – that of reduced response time. He tells us: “When the cooling system is faulty, we have only 30 seconds or even 10 seconds response time. Traditional architecture cannot meet O&M requirements, so a continuous cooling architecture is a must in high-density scenarios.”

This challenge is compounded by the fact that further delays can be caused if a compressor fails. He explains: “Core temperature control components, such as the compressor and fan, are moving components with high speed or high pressure. It takes a long time to restart the components after shutdown.”

This makes it vital that any adopted system offers quick recovery in the event of abnormal scenarios, such as natural disasters, leaking pipes, or even cyberattacks. “In this case, we need to quickly restore the cooling system to minimize the loss. Escape passages, such as one-click maximum cooling output, fast restart after device interruption, and fast liquid refilling of the liquid cooling system, are especially important,” He remarks.

As for Huawei’s solution to these issues: “Cooling in intelligent computing centers is a major challenge. Liquid cooling is being rapidly adopted, but there are uncertainties in intelligent computing center operations, such as the ratio of liquid-cooled to air-cooled servers. To address these uncertainties, Huawei has developed a hybrid cooling solution that combines air and liquid cooling, which not only flexibly meets the cooling requirements of different servers but also saves energy.”

For more information click here