Archived Content

The following content is from an older version of this website, and may not display correctly.

In the world of mass consumer cloud-based services there are two main stories. For those whose job is brand protection and promotion that is the story of renewable energy or making the data center appear green.

For those whose job is to design and operate data centers, the story is about efficiency and failure – that is driving efficiency and avoiding downtime.

David Gauthier says there are two choices.

“We can keep doing the data center as we have but my concern is the sheer complexity of traditional data centers. Complexity requires more maintenance. And more often than not it is the maintenance that brings down the data center,” Gauthier says.

The second choice is moving the control away from the physical layer.

“If I look at a world where the abstraction layer is the controller software it is good at dealing with complex systems. Software is binary and good at managing a requirement. Either it is on or off,” he says.

“At a hardware level we have reduced the number of SKUs (stock keeping units) to remove complexity. Once I do that I can think about fungibility. Once I can make the server redundancy less about the hardware I can think about a rack or row  instead of individual servers.

“The intelligence is in the software and I have an interface between the behaviour of the facilities and the systems the applications are running on. This computerized maintenance system is up and the software can drain a cluster in the event of a problem.”

This raises the question of trust in software for mission-critical operations.

“There is confidence being built in these systems. Hardware is always going to fail but software can continue to improve,”

Gauthier says.

“Agility is the key because once you’ve built the data center, if there is a latent defect, you can’t change it if, for example, there is a logic error in a PDU.”

THE VALUE IS IN THE SIMPLICITY

Gauthier says that the individual equipment items are themselves becoming simpler. “Inside those servers I can get rid of the dual power packs and all the unnecessary components. Top-of-rack switches are being deployed inside Microsoft’s data centers,” he says.

Microsoft has two ways of ordering servers. The first is RFQs for thousands of standard servers. The second is more specialist. 


Gauthier describes the data center control system thus: “Document and understand the data center, document the current state. The way we’re constructing out the data center is to make it automated. This is extremely important for people to understand. You need to be able to automatically map your data center so you understand what resides where.”

“You need to understand what are your failure domains – and align those. Then figure out how to deploy apps against those failure domains. “We’ve been running this type of system for a number of years – we had a dual utility failover at a data center. The software did its job and no-one knew the system was down. In our case [our product] is a home-grown system. We’d had a form of DCIM which was an asset management and ticketing system. It is an extension of existing systems. A totally new system would be operationally challenging.”

What is clear is that Microsoft is running its cloud from operation centers around the globe and that the system it is running is only relevant to cloud-scale subsystems running hundreds of thousands of servers.

“When operating at a cloud-scale, you can expect that at any one time between 1% and 2% of your servers will be in ‘fail’ state,” Gautier says.


DUBLIN OPERATION

Microsoft in Dublin is on its first refresh of IT hardware or, as it is expressed: “It is decommissioning at a kW scale and re-commissioning at a MW scale.”

In Dublin the facility runs at a 24°C target inlet temperature with a 10-30% range. It was the world’s first large-scale air-side economizer facility to open. In the first halls that Microsoft built it installed fans for inlet and outlet air. For the latest

132k sq ft phase the air is pushed by the roof mounted air handlers and returned on positive pressure.

Microsoft is not opting for high density topology. It believes that low density suits its needs for simplicity. Of course, low density means more connectivity and Microsoft says the cost of the ‘glass’ for fiber connectivity is

now a significant line item.
This density is also relative. In the latest ‘colo hall’ there exists the same compute density in half the space. The rows are closer together. The aisles are narrower. This layout is being copied in three other locations, according to Gauthier. Topology is five nines and cost of service is what dictates this.
Just as Christian Belady did, Gauthier says 
that while it measures PUE, Microsoft is less concerned with its metrics numbers than it is with its capital efficiency.

And efficient use of capital is built on homogeneity.

 

This is the second part of an article on Microsoft’s build out strategy. Read Part 1 here. The article first appeared in DatacenterDynamics FOCUS magazine May/June 2013. Read the digital edition here.