Roblox, operator of the eponymous video game platform, said that it had expanded its data center infrastructure in the wake of a devastating 73-hour outage.
The company's service went down in late October, with Roblox detailing what went wrong in a lengthy blog post this month.
Prior to the outage, Roblox said that it operated over 18,000 servers and 170,000 containers. "In order to run thousands of servers across multiple sites, we leverage a technology suite commonly known as the “HashiStack.” Nomad, Consul and Vault are the technologies that we use to manage servers and services around the world, and that allow us to orchestrate containers that support Roblox services," CTO Daniel Sturman said.
"In the months leading up to the October incident, Roblox upgraded from Consul 1.9 to Consul 1.10 to take advantage of a new streaming feature. This streaming feature is designed to significantly reduce the CPU and network bandwidth needed to distribute updates across large-scale clusters like the one at Roblox."
When Roblox services communicate, they rely on Consul to have up-to-date knowledge of the location of the service it wants to talk to. This was a problem, as over the course of October 28 Cosul services began to deteriorate - eventually knocking the whole platform offline.
After tens of hours of trial and error, the company realized that something was going wrong with the new streaming feature in Consul. "Why was streaming an issue?" Sturman said.
"HashiCorp explained that, while streaming was overall more efficient, it used fewer concurrency control elements (Go channels) in its implementation than long polling. Under very high load – specifically, both a very high read load and a very high write load – the design of streaming exacerbates the amount of contention on a single Go channel, which causes blocking during writes, making it significantly less efficient. This behavior also explained the effect of higher core-count servers: those servers were dual socket architectures with a NUMA memory model. The additional contention on shared resources thus got worse under this architecture. By turning off streaming, we dramatically improved the health of the Consul cluster."
Several other issues were then tracked down and systematically fixed, overnight. The service was brought back online, in stages, on October 30.
To avoid a similar outage happening again, Roblox plans to overhaul much of its back-end software. It also plans to move Consul from a single cluster to an "additional, geographically distinct data center." The company also plans to have different availability zones within the data centers for added redundancy, and said it had upped its hiring plans to support the data center expansion.
The outage - which cost the company around $25 million in lost bookings - has not changed Roblox's view that it should operate its own infrastructure.
"For our most performance and latency critical workloads, we have made the choice to build and manage our own infrastructure on-prem," Sturman said. "By building and managing our own data centers for backend and network Edge services, we have been able to significantly control costs compared to public cloud. These savings directly influence the amount we are able to pay to creators on the platform. Furthermore, owning our own hardware and building our own Edge infrastructure allows us to minimize performance variations and carefully manage the latency of our players around the world. Consistent performance and low latency are critical to the experience of our players, who are not necessarily located near the data centers of public cloud providers."
However, Sturman added that the company was "not ideologically wedded" the the approach, and used public cloud for burst capacity, DevOps workflows, and most of its in-house analytics.