Content Delivery Network (CDN) company Cloudflare suffered a second data center power failure at its core facility in Portland, Oregon, within six months.
However, the company said engineering work in the interim months meant the impact of the second outage was minimal compared to the first.
The first failure
In November, the company suffered a power failure at its PDX01 data center in Portland. A number of services – including Cloudflare’s main control panel – were brought offline.
A subsequent report from the company said one of two Portland General Electric (PGE) power feeds into the facility failed after unplanned maintenance. UPS backup power didn’t last as long as planned, and Flexential wasn’t able to get the site’s generators started in time.
There were several other operational and component issues that concurrently impacted service recovery – including a lack of failover processes from Cloudflare.
A second failure
In a new blog this week, the company detailed the second incident and documented the automated failover work the company had completed to mitigate the failure of one of its primary locations – a Flexential data center in Portland known as PDX01.
“We didn’t expect such an extensive real-world test so quickly, but the universe works in mysterious ways. On Tuesday, March 26, 2024, — just shy of five months after the initial incident — the same facility had another major power outage,” the company said.
As a result of the first incident, Cloudflare introduced ‘Code Orange’; a new process where the company shifts most or all engineering resources to addressing the issue at hand when there’s a significant event or crisis.
“Over the past five months, we shifted all non-critical engineering functions to focusing on ensuring high reliability of our control plane,” the company said. “We had spent months preparing, updating countless systems, and activating huge amounts of network and server capacity, culminating with a test to prove the work was having the intended effect, which in this case was an automatic failover to the redundant facilities.”
In March, the company suffered another power issue again at the same facility. However, thanks to several months of failover work, the impact on its services was minimal compared to the November event.
“On March 26, 2024, at 14:58 UTC, PDX01 experienced a total loss of power to Cloudflare’s physical infrastructure following a reportedly simultaneous failure of four Flexential-owned and operated switchboards serving all of Cloudflare’s cages,” Cloudflare said of the outage. “This meant both primary and redundant power paths were deactivated across the entire environment.”
“Initial assessment of the root cause of Flexential’s Circuit Switch Boards (CSB) failures points to incorrectly set breaker coordination settings within the four CSBs as one contributing factor,” the company added.
In Cloudflare’s case, Flexential’s breaker settings within its four CSBs were too low in relation to the downstream provisioned power capacities. When one or more of these breakers tripped, a cascading failure of the remaining active CSB boards resulted, causing a total loss of power serving Cloudflare’s cage and others on the shared infrastructure.
“During the triage of the incident, we were told that the Flexential facilities team noticed the incorrect trip settings, reset the CSBs, and adjusted them to the expected values, enabling our team to power up our servers in a staged and controlled fashion.”
Cloudflare said it did not know when these CSB settings were established – but these would typically be set/adjusted as part of a data center commissioning process and/or breaker coordination study before customer critical loads are installed.
Flexential never publicly mentioned the second outage incident. DCD has reached out for comment.
Cloudflare learns its lessons
The company said that during the November incident, some services had at least six hours of control plane downtime, with several functionally degraded for days. During the March incident, however, the company said its services were up and running within minutes of the power failure, with many seeing no impact at all during the failover.
“On March 26, 2024, at 14:58 UTC, PDX01 lost power and our systems began to react. By 15:05 UTC, our APIs and Dashboards were operating normally, all without human intervention. There were a few specific services that required human intervention and therefore took a bit longer to recover, however, the primary interface mechanism was operating as expected.”
The company said its Control Plane, which consists of hundreds of internal services, is designed so that when it loses one of the three critical data centers in Portland, these services continue to operate normally in the remaining two facilities. The company also has the capability to fail over to its European data centers if all three Portland centers are completely unavailable.
“Though the March 26, 2024, incident was unexpected, it was something we’d been preparing for,” the company said. “More than 100 databases across over 20 different database clusters simultaneously failed out of the affected facility and restored service automatically. This was actually the culmination of over a year’s worth of work, and we make sure we prove our ability to failover properly with weekly tests.”
The cold start of the PDX01 site took roughly 72 hours in November 2023, but this time around, the facility was restarted in roughly 10 hours. The company aims to reduce the cold restart time in future.
Cloudflare noted, however, that its Analytics platform was impacted and wasn’t fully restored until later that day. The company said this was expected behavior as the Analytics platform is reliant on the PDX01 data center.
“Just like the Control Plane work, we began building new analytics capacity immediately after the November 2, 2023, incident, Cloudflare said. “However, the scale of the work requires that it will take a bit more time to complete. We have been working as fast as we can to remove this dependency, and we expect to complete this work in the near future.”