Cloudflare experienced a significant outage after a technician disconnected multiple, redundant fiber connections from one of its two core data centers.
The incident happened during planned maintenance of the facility, with the company blaming its own poor instructions and lack of cable labeling rather than the technicians themselves.
"As part of planned maintenance at one of our core data centers, we instructed technicians to remove all the equipment in one of our cabinets," the web-infrastructure and website-security company's CTO John Graham-Cumming said in a blog post.
"That cabinet contained old inactive equipment we were going to retire and had no active traffic or data on any of the servers in the cabinet. The cabinet also contained a patch panel (switchboard of cables) providing all external connectivity to other Cloudflare data centers. Over the space of three minutes, the technician decommissioning our unused hardware also disconnected the cables in this patch panel."
Due to this, the Cloudflare Dashboard and API were unavailable from 15:31 UTC until 19:52 UTC, while the Cloudflare network continued to operate normally.
Restoring services took longer than it needed to, Graham-Cumming admitted, due to the time needed to identify the cables needed to provide external connectivity. "We should take steps to ensure the various cables and panels are labeled for quick identification by anyone working to remediate the problem."
In future, the company plans to improve labeling, as well as clearly inform technicians about which cabling should not be touched.
Graham-Cumming added: "While the external connectivity used diverse providers and led to diverse data centers, we had all the connections going through only one patch panel, creating a single physical point of failure. This should be spread out across multiple parts of our facility."
The issue may also have been exacerbated by the fact that the 'war room' teams working to restore services had to do so remotely, from their respective homes, due to Covid-19.
Last year, Cloudflare experienced two major outages a week apart, for unrelated reasons - the first, a BGP error, the second due to an issue with its own DDoS protection system.
The company has been unusually open about its outages, including one of its first back in 2012 - a hack. "That was incredibly painful," CEO Matthew Prince told DCD.
"It affected one of our customers, and it impacted me personally, because the hacker had actually hacked into my personal email in order to get in.
"And I was embarrassed - frankly, I didn't want to share the details of everything that happened. Our team said 'No, that's not our culture, and that's not what we stand for and we really believe in this idea of being radically transparent with whatever happened.' I was afraid that we would lose customers. It turned out instead... that transparency actually helped people build trust."