Cloudflare has blamed a November 2 outage on a power failure at a Flexential data center in Hillsboro, Oregon.
The Content Delivery Network (CDN), security, and Edge computing company alleged a number of failures that led to its services going offline.
In a lengthy post-mortem, CEO Matthew Prince explained that the PDX-DC04 facility was its primary data center for Cloudflare's control plane and analytics systems. Two other data centers near Hillsboro also handle the service.
Cloudflare consumes about 10 percent of PDX-DC04's total capacity.
"On November 2 at 08:50 UTC Portland General Electric (PGE), the utility company that services PDX-04, had an unplanned maintenance event affecting one of their independent power feeds into the building. That event shut down one feed into PDX-04," Prince said.
Prince claims that Flexential powered up its generators to supplement the feed alongside redundant power feeds - but said that the company did not inform Cloudflare.
"Had they informed us, we would have stood up a team to monitor the facility closely and move control plane services that were dependent on that facility out while it was degraded."
Prince added that it was unusual for Flexential to not just move fully to generators, and speculated that Flexential was part of PGE's Dispatchable Standby Generators (DSG) program and was using its generators to help supply additional power to the grid.
Flexential's COO, Ryan Mallory, told DCD: "There are different scenarios for data center providers to work with the utility to provide capacity back to the grid. PGE had the issue several years ago with wildfires, where the powerlines came down in the Hillsboro area. So there are these types of scenarios that are in place to support the power utility if there is an issue with the grid."
However, when asked directly if the DSG scheme was in use prior to the outage, he said: "I'm not prepared to address that, it's still being worked with PGE right now."
Whatever the reason for the generator and feed combo, it was then followed by a ground fault on a PGE transformer at PDX-04. "It seems likely, though we have not been able to confirm with Flexential or PGE, that the ground fault was caused by the unplanned maintenance PGE was performing that impacted the first feed," Prince said.
The protective measures brought the generators and feeds offline, meaning no power was coming to the facility.
A bank of UPS batteries were meant to last 10 minutes to give Flexential enough time to get the generators working, but "the batteries started to fail after only four minutes," Prince claimed. "And it took Flexential far longer than 10 minutes to get the generators restored."
He added: "While we haven't gotten official confirmation, we have been told by employees that three things hampered getting the generators back online. First, they needed to be physically accessed and manually restarted because of the way the ground fault had tripped circuits.
"Second, Flexential's access control system was not powered by the battery backups, so it was offline. And third, the overnight staffing at the site did not include an experienced operations or electrical expert — the overnight shift consisted of security and an unaccompanied technician who had only been on the job for a week."
Cloudflare claims that it was not informed of any of the issues, and first found out something was wrong when its routers began to fail.
When generators were restored, the circuit breakers connecting to Cloudflare's IT were discovered to be faulty.
"We don't know if the breakers failed due to the ground fault or some other surge as a result of the incident, or if they'd been bad before, and it was only discovered after they had been powered off," Prince said.
Restoring them took time as Flexential did not have enough replacements on hand.
"Candidly you know, I'm not aware of what they're talking about," Flexential's Mallory told DCD. "When there is an issue at the facility, their standard operating procedure is to go through a cascading evaluation of the facility.
"And so, having somebody call out breakers as being an issue is something that - I don't know where that information came from. But we have a standard operating procedure for people to our engineering staff at the site to walk through and check every tier of electrical systems. And so we made sure that all of the systems were operating, and that customers were able to come back online as soon as they were able."
Mallory was unable to answer most of DCD's questions, noting that their own root cause analysis (RCA) was still underway. The results of that RCA will be shared with customers, but not with the public. DCD will update this story if we learn more following the RCA.
"We have a number of questions that we need answered from Flexential," Prince said. "But we also must expect that entire data centers may fail."
Prince also admitted that Cloudflare made its own mistakes, with two critical services only running in PDX-04. While it had tested the other two facilities going offline, or parts of PDX-04 going down it had never tested the entire PDX-04 going offline.
"We were also far too lax about requiring new products and their associated databases to integrate with the high availability cluster," Prince said.
Going forward, the company plans to "remove dependencies on our core data centers for control plane configuration of all services and move them wherever possible to be powered first by our distributed network."