Land of the Giants: Building the hyperscale cloud

The construction of the world’s digital infrastructure has been a uniquely collaborative affair, with governments, research institutions and corporations all playing their part in the creation of a monumental web of data centers, cables, towers, satellites and sensors.

But there are a few companies whose contributions to the whole has been unrivaled, firms that built networks responsible for a vast portion of digital traffic, and which are spending billions to extend their dominance even further.

“Fifteen years ago when I started at Google I didn’t imagine that we would be building the world’s largest network, or the world’s largest compute infrastructure,” Benjamin Treynor Sloss told DCD. As vice president of 24/7, his job is to keep Google online - all of it, from Search to Maps to Cloud Platform.

“If Google ever stops working, it’s my fault,” he said. “That’s my job. You know, one year at a time.”

The growth challenge

When Sloss joined the business, he couldn’t predict how large the company and its infrastructure requirements would become. “I just knew that we had a set of things that we needed to do in the next three months and I could extrapolate out with a great deal of confidence for the next two to three years.”

At Microsoft, the experience was similar for David Gauthier, the company’s senior director of data center strategy and architecture. “I’ve been at Microsoft about 19 years, and I’ve been involved in our data center infrastructure that whole time,” he told DCD.

“It’s been quite a journey - coming up from the early days of Microsoft, with MSN, and then the original push into algorithmic search with Bing. We thought we were hyperscale back then: I don’t think we had any real grasp of what was coming. This thing has just taken off in a way that is really unique to any industry.”

Google, too, has had to deal with extraordinary growth, further exacerbated by its entry into the cloud services market. “As we started to offer a public cloud product we used the same data centers and really the same infrastructure, the same network, same servers, the same everything, that we were already using,” Sloss said.

His goal, now, is to enable Google Cloud Platform customers “to build a service with the same availability and the same performance and the same feature richness as Google Search or Gmail.”

Handling this challenge requires a careful balancing of ideas, roadmaps and priorities, Sloss said, likening his job to that of a portfolio manager. “I’ve got 5,000 people in my team, and in each area I’ve got some people working on things that are not needed in the next three to six months.”

A lot of Google staff are working on iterative improvements and things that will “eventually become forced moves,” while “a fair fraction” are focused on “larger leaps that have a lower probability of success,” Sloss said. “The two halves have to go hand in hand.”

Employees are encouraged to follow Google’s 70-20-10 philosophy (70 percent on core business, 20 on core related projects and 10 on unrelated projects). “We invest in a number of those [further out] efforts each quarter in order to get the few that actually do pan out, to turn into projects that can move the needle quite significantly.

“For example, using machine learning to make huge power efficiency gains - the person who proposed that was from one of the iterative teams,” Sloss said. In 2016, Google’s DeepMind division announced it had achieved a 15 percent improvement in power usage efficiency (PUE) at one of the company’s data centers.

Details are limited on how widely Google has adopted the algorithm: “I would say this: It is being rolled out, and we will continue to roll it out as we build or retrofit new data centers. But if you were to look at the majority of data center capacity that we have at this point, they’re already benefiting from it,” Sloss said.

As for Microsoft, Gauthier said: “We’re all using AI and machine learning to optimize our infrastructure, bringing down the energy consumption and the water and other resource consumption.”

The capacity question

Machine learning has also been enlisted to help with capacity planning. Last year, Amazon Web Services’ CEO Andy Jassy said that “one of the least understood aspects of AWS is that it’s a giant logistics challenge, it’s a really hard business to operate.”

“We are, of course, using machine learning in many areas,” AWS technical evangelist Ian Massingham told DCD. “Capacity forecasting is a classic sequence prediction machine learning use case. So why wouldn’t we be doing it? We actually have customers that are doing that as well. Games publisher Electronic Arts is using machine learning for planning its own EC2 capacity fleets so when they launch new games, they’ve got enough capacity ready.”

“They haven’t been terribly specific about what they meant but I can take a pretty educated guess,” Sloss said about AWS’ capacity planning. “Demand planning isn’t just a simple extrapolation of a logarithmic curve. There are actually predictable peaks and troughs,” like how demand for Google Search grows between September and May and then flattens between June and August, when people are not in school.

“So you can see that there are these historical effects. We can plan our capacity, and it becomes important to get those five percent efficiency improvements when you’re talking about billions of dollars of infrastructure. I’m assuming Amazon is roughly doing the same thing. We haven’t said anything about it, but we’ve been doing things like this for more than 15 years.”

Gauthier was equally coy: “I think it would be fair to assume that we could be using that - I can’t confirm it.”

To be certain that there is enough capacity to meet sudden demand increases, every time Microsoft Azure launches a new region, it ensures that its data center locations have space for new data halls, and its utility providers have additional resources. “The last thing we want to do is go open a brand new region and not have options for growth,” Gauthier said.

“I like to say that we maintain the illusion of infinite capacity. That’s really the challenge in cloud computing as a lot of the infrastructure and the hardware is getting mature. How we do capacity planning to maintain that illusion is really where a lot of ‘special sauce’ is today.”

One way to keep some control over sudden capacity shifts that AWS has pursued is introducing limits on how many instances a client can start without discussing their plans with the cloud company. “So if you want to go above those account limits, you raise a request form and that’s subject to a quick review of your use case,” Massingham said. “Then we know what the potential consumption footprint is, and we can use that to inform our capacity planning.”

To allow the company to overprovision without losing too much money as a result, AWS also operates the EC2 Spot market, a discounted auction-style market where customers bid for resources which can be reclaimed if another customer buys them using a classic market model. “What you’re looking at there is our attempt to recover the marginal cost of that, as yet unused, capacity; capacity that has not yet been sold for demand usage or for reserve instances,” Massingham said.

The spot market was an initiative requested by the AWS community, a community which Massingham believes gives the cloud company a unique edge.

“We were early, and for some reason other people that may have become competitors didn’t realize the potential impact of cloud computing for quite a few years, so we had a head start,” he said.

That head start has resulted in a massive collection of customer feedback and usage data, “so we have really good insight into what customers find important about the existing services and have a great opportunity to talk with customers about what they want us to add to the platform,” he said.

The next big thing

When it comes to cloud market share, AWS might be ahead at the moment, but each company has a huge R&D department trying to find the next big thing, the improvement, feature or innovation that will give them the edge - or at least cut the cost of their internal services.

“We run a number of different programs that span two years to five years and beyond, to try to keep ourselves abreast of where technology is going,” Gauthier said.

“We look at the rest of the things in the data center that are taking time and money and energy to run, from the generators, to the UPSs, to the power distribution, and then see how necessary those really are if you have a well-designed hyperscale system that handles faults in software and handles availability challenges by distributing workloads.”

The company is also trying to escape its reliance on the electric grid, experimenting with hydrogen and methane gas-powered fuel cells at the rack level for the past five years. “You take out all the losses of the grid, take out all the distribution challenges of transformers, and bring them into one extremely efficient package. Our pilot data center is running very well for us, and it allows us to show the proof of possibility to the supplier ecosystem around fuel cells.”

Another advantage, Gauthier said, is the fact that anything that eliminates the need for diesel generators in data centers will make gaining permits for new sites significantly easier. “I can’t give you a timeline for when fuel cells will be in a production data center, but I can say that it’s definitely a top priority for us. It’s a super interesting technology and it’s something we’re sharing with the ecosystem. We have a regular conference where even some of our competitors come and talk about the technology and how we can mature it for the industry.”

A source at the US Department of Energy’s National Renewable Energy Laboratory, which has collaborated with Microsoft on testing fuel cell technology, confirmed to DCD that representatives from Google have expressed an interest in the tech, visiting the government laboratory to learn more.

Another area Microsoft’s R&D is very much engaged in, Gauthier said, is the topic of high density rack cooling, potentially using liquids. With the number of AI and ML workloads growing, “we definitely are seeing density increasing, and in the air-cooled space that is something we’re watching very closely. We maintain a little bit of a trigger point where we start moving in the direction of other cooling technologies.”

Google is also looking into liquid cooling. With its latest generation Tensor Processing Unit, the TPU 3.0, it has turned to this technology for the very first time. “Other things being equal, liquid cooling is more expensive than air cooling because you have more pipes and more copper and more heat exchangers and you have to have a little thing sitting on top of every chip,” Sloss said. “So you don’t do it unless you really need to, but physics requires that you do it because of the power density of these machine learning systems.”

Before adopting TPUs and other internal hardware products, Google usually tries out the equipment among its tens of thousands of staff. “When we first came out with them, let’s face it, they were clunky,” Sloss said. “You had maybe 20 of them and they needed constant service to work. It was not really in a form where you could offer it as a service.”

In cases like this, Google turns to ‘dogfooding,’ the process of using its own employees as a test base. “It’s a large enough user base that you’re going to find all sorts of things that you would find in public, but with a much more forgiving audience. Googlers internally may make a meme when you give them something that doesn’t work well, but you don’t end up with press headlines about it.”

This process, and the focus on innovation, has helped the company stay ahead, with Sloss seeing Google as “the first company to do cloud computing at scale, as we were building this stuff back in 1998. Now we have several companies that are building cloud computing at scale and a lot of the folks who have historically had bespoke systems and bespoke data centers are appreciating that, actually, cloud computing brings a large benefit both in terms of flexibility and in terms of economics.

“So I’m not worried about us becoming stuck with the design that we’ve got - the design that we’ve got now is the thing that everybody’s investing in. But a more interesting question may be: what comes after cloud computing?”

Again, he sees it as a portfolio matter: “How much of Google’s total engineering is going into using the infrastructure that we’ve got today versus using the next generation? I don’t know if I could accurately predict what infrastructure will look like in 15 years. But I will observe that Google appears to be on the forefront of machine learning infrastructure, which is barely different from cloud computing infrastructure. To me, that is an interesting new angle on where computing is going.”

Planning for downtime

But innovation requires sacrifice, Sloss warned. He created the concept of Site Reliability Engineering - a discipline that incorporates aspects of software engineering and applies it to IT operations problems - and, in the book of the same name, he notes that product development and SRE teams can suffer from “a structural conflict between pace of innovation and product stability.” This can be resolved “with the introduction of an error budget,” a set percentage of errors and downtime.

With this in mind, as the VP of 24/7, does he have an error budget? “Yes, it’s just very small,” Sloss said. “Google’s availability targets are typically in the five nines range.”

If you consider all the other pieces of non-Google infrastructure involved in making an online search, Sloss said, when a user cannot access Google “it is almost always because of something that has nothing at all to do with us. So you, as a user, can’t actually tell the difference between full availability and five nines of availability. To you it appears identical. But the level of effort and cost that’s required, and the drain on engineering resources and feature velocity that’s required to go from five nines to, say, six nines is actually immense.”

Sloss believes it is this realization, that 100 percent is not the right availability target for most services, that is key. “Even if you were 100 percent perfect, actually people’s experience of you is going to be perhaps two and a half nines. Once you’ve got that, then the question is: What availability target is the right balance between making your users extremely happy and being able to deliver them lots of new products at a rapid pace and at very low price point? And then it is just about picking the correct point on that, which is crucially important.”

Achieving high availability across massive systems, while still growing and rolling out new features, has presented difficulties for all of the major cloud companies, each having suffered unplanned outages and downtime.

“We’ve hit scaling challenges within AWS that most providers will never get to,” Massingham said. “We architected systems to address those challenges that most providers have never had to architect yet.

“We have had service incidents in the past, of course, but the one thing you might have noticed is the frequency of these incidents is much, much, much lower than it ever has been historically,” he said.

One of the ways to reduce the number of incidents that Microsoft has found useful is to simplify the data center. “I’ve been in some data centers that are not ours, where some of the maintenance transfer procedures are 75 or 80 steps long and that person’s going to do that on 10 or 12 different power trains,” Gauthier said.

“It’s just the law of big numbers. You’re going to make a mistake in there somewhere. And so we spent a lot of time in the design of our data centers, thinking about how we minimize the steps you have to take in a maintenance situation.”

But the biggest cause of outages, Sloss said, was software bugs. “When people think about disasters in cloud computing they tend to fixate on the things that are dramatic like: ‘What if there was a fire? What if there was an explosion? What if there was a huge power outage? What if there was an earthquake?’ No, the actual problem is software bugs, by far and away.”

For this reason, the company has invested heavily in cold storage, as “resilience requires you to have offline storage that a bug can’t touch.” To avoid future errors, after an outage Google has “a no blame post-mortem culture, because if something goes wrong, people didn’t intend to break it, but they broke it anyway.

“Even with the most spectacular problems that we’ve had, the focus is never ‘who can we blame for this?’

“It’s about: ‘How can we fix our analysis, our processes, our system and so on, so it doesn’t happen in the future?’ That philosophy plays perfectly into what we’re doing in the public cloud now because in the public cloud you can’t control who’s using your systems. You can’t go and say ‘no don’t do it this way do it this other way.’”

Instead, “you have to have it so that the systems are supporting people taking best practices seriously, and actually make it easy for them to make good decisions.”

Together, these companies have had to change the way they operate, and, in doing so, have had to change how everyone else operates.

But as these goliaths focus on building and fixing systems at a scale never before seen, Microsoft’s Gauthier isn’t thinking solely about that. Instead, he’s reminded of his father: “My dad worked at NASA during The Space Race and I’m like: did he know what was going on at the time? Because I’m thinking, am I going to look back in 15 or 20 years and go ‘holy crap how did we do that?’

This article was the cover feature of the June/July issue of DCD magazine. Subscribe for free today:

Land of the Giants: Building the hyperscale cloud

The growth challenge

The capacity question

The next big thing

Planning for downtime

Tags

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies

Success story: Kao Data and Cadence