It was only a skip in time but Facebook’s VP of site operations Tom Furlong says the leap second his data center operations had to accommodate back in the summer of 2012 taught Facebook valuable lessons about capacity which have had a lasting impact on its web-scale operations.
Furlong was expanding on a talk original given at DatacenterDynamics (DCD) Converged in San Francisco earlier this year at the DCD Converged London event, which more than 2,500 people attended. In the US Furlong documented the build of Facebook’s home-grown Data Center Information Management (DCIM) suite, which he says helps Facebook “knit together” information from the data center with data from servers.
In the UK Furlong offered more insight into its DCIM outcomes, saying the tools it has designed can now help Facebook overcome unusual outages.
The leap second created a “bug” in the Linux servers used by Facebook which caused 100% CPU usage across the Facebook operations, Furlong says. “The way it manifested for us was most of our machines throttled to about 100% CPU, which is rather fascinating and a slightly unexpected result. We saw across our entire footprint 100% utilization and on every single machine in our Ashburn facility which meant we lost a couple of dozen cabinets for exceeding breaker capacity,” Furlong says.
Furlong says due to the nature of Facebook’s business (which is software supported by data centers for the popular social media site) resilience is often created by regional failover or it’s absorbed by the site itself, which unlike mission-critical business operations can run at slower rates without too much business impact.
While a couple of dozen cabinets is not a large outage in Facebook’s accounts, it did lead Furlong and his team to consider the amount of headroom they had been working with, but not as you would think. Furlong actually asked if Facebook had too much capacity due to the event’s unusual nature.
“The leap second was such a bizarre event for us. To think about every machine going to 100% CPU utilization – that just doesn’t happen,” Furlong says. “It doesn’t happen in failover, so we started to look at how much capacity we had and looking at the behaviour of individual clusters, to see how much power they had so we could match it with the server utilization they were showing.”
Capacity planning at Facebook
Part of its capacity project, DCIM has become a large part in Facebook’s efforts to reduce server sprawl, Furlong says. But to define capacity, Facebook needed to ascertain the way applications really affected workload.
Servers at Facebook have a varying workload, with peaks and troughs of use throughout each day, according to Furlong. “So we needed to look at the servers and see what behaviours they exhibited under the workload we got, and that’s complicated. Then redundancy could be added,” Furlong says. “We have a good idea of what the behaviour of the software is because we model it, then we try determine the best server for it but typically we get software out and then iterate on it significantly which means that the software needs can change.”
Furlong says these calculations on consumption of servers under differing workloads has also been key to energy efficiency gains at Facebook, which aims to run clusters up into the high 90% utilization.
He says previously, when Facebook wanted to put clusters in after optimization for workloads, the teams had assumed they would maintain their optimal levels. “We would install and forget about it, saying it will be there for three years and in three years we will decommission it,” Furlong says. But this was not the case.
Facebook went back to revisit some older clusters and found them running much lower. “Some clusters we had in for 18 months we found were not anywhere near to being able to produce at the high 90% [CPU utilization they were optimized for]. We had lost about 10% of our theoretical capacity,” Furlong said. “This was because the application needs changed over time.”
Initially, Furlong and his teams tried to overcome the situation manually. “We had to do rotate testing through our clusters and adjust them to make sure we got full utilization,” Furlong says.
Modelling through DCIM has helped automate this process, which Furlong says became more complicated manually due to the ever-going changes taking place with workload.
“We would manually calculate what the clusters were going to look like, and our engineers would give us something to say what this cluster is going to be but because change is constant, the cluster layout could change weekly, up to the point at which we would deploy. This meant it would take us another two days to re-layout the cluster,” Furlong says.
“Before our Forest City data center in North Carolina was launched in 2012, we altered the clusters 13 times in the space of something like three months and that was when we said enough, there is too much rework on this.”
Facebook has now been able to cut the time it takes to deploy clusters down to a number of hours by using modelling tools that gather information from its servers – on CPU utilization, IOPs, server metrics – and its larger data center environment, including building management systems, power, cooling and capacity.
“We can go back and redo these tasks very quickly now, so instead of doing laborious work we can do more interesting things with our time,” Furlong says. “We can take ten servers operating at 50% utilization, then take five of those and push them down to zero, so theoretically we are using less energy. We can also take the front end, which is running a virtual environment, dial those machines down and turn them up when we need capacity. We can also do power capping, putting more servers in a row and then stop them from going right to the top when they need do, so we can better protect ourselves against a spike,” Furlong says.
“We operate with what we call contention reduced latency, which is a contention in the resources when the machines are fighting to get the net results. When we are at high utilization the sight slows down, so now that is really the only kind of failover that happens. Then when we cut off some of those top peaks, which allows us to safely put in more servers without fear of tripping breakers.”
This approach could have been used by Facebook to overcome the leap second event, by simply adding some more servers to level CPU utilization out during the leap second event.
Furlong says Facebook had no idea how valuable tools for cluster planning could be until Facebook started to implement the practice. “We now know where all our machines are, and we have real feel for where the racks are. Before that there wasn’t a real source of truth, but this is an evolutionary thing,” Furlong says.
He says Facebook is still finding new and interesting ways to use DCIM, more of which will be revealed at the next Open Compute summit, where DCIM will be a key focus, in San Jose on June 29 next year.
You can also read more on how Facebook built its DCIM suite in coverage from the San Francisco event here and see a video from DCD Converged in London, where Tom Furlong looks at the future for web-scale data center operations at Facebook here.