Cloudflare has shared some insight into how it keeps its servers operating.
In a blog post published by the company this week, Cloudflare said that it approaches server maintenance through an "error budget," among other techniques including autonomous hardware diagnostics.
Cloudflare, a provider of Edge computing, security, and content delivery networks (CDNs) with servers located in 310 cities and 120 countries, said it has developed a fault-tolerant infrastructure that can continue operating with "little to no impact" by failures.
Prior to this, the company would have to send out a member of its data center operations team to manually troubleshoot and diagnose every server fault, and get them back in operation - something which could take hours for a single server.
The new solution is autonomous, meaning that it can operate independently without human intervention or oversight, and has been called 'Phoenix.'
According to the post, Phoenix runs autonomous diagnostics and recovery automation at regular intervals to detect servers that are broken. The system then figures out what is wrong with the server and recovers those that pass diagnostics by "re-provisioning, and ultimately re-enabling those that have successfully been re-provisioned in the safest and most unobtrusive way possible."
The system can understand the cause of failure and revert the state of the server accordingly.
Phoenix runs every 30 minutes on a maximum of two data centers at a time, meaning the entire Cloudflare fleet is covered in three days. On each run, it also notes servers that are already queued for recovery and makes sure the issue is solved immediately.
If the servers are unable to be fully recovered, Phoenix will assess them and may return to the repair state for additional evaluation. If they need a physical component replaced, the data center operations team is notified.
Cloudflare has also taught Phoenix that if there are other automation executing operations such as expansions, it will only do checks when safe to do so so that the recovery operation does not interfere with other operations at the data center.
Fault tolerance is also built-in. "This means it’s able to gracefully deal with misbehaving servers by letting these quickly drop out of the recovery candidate list upon misbehavior that prevents blocking the operation," the post says.
Cloudflare further acknowledged that "not every broken server can be re-enabled and successfully returned to production, and more importantly, there's no 100 percent guarantee that a recovered server will be as stable as the ones with no repair history." The solution to this is what the company calls an 'Error Budget' in which the Phoenix system stops recoveries without any human intervention if the server fails a certain number of times in a given window.
Earlier this year, Cloudflare revealed that it had extended its server hardware lifespan to five years, thus saving around $20m. Similar moves have been made by hyperscalers including Amazon, Google, Microsoft, and Meta.
Throughout 2023, Cloudflare successfully deployed GPUs in 120 cities for its Edge network. By the end of this year, the company plans to have deployed the accelerators in "nearly every city" that makes up its global network.