It’s more than two months since fire destroyed one of OVHcloud’s data centers in Strasbourg, damaging a second one so severely that it will not be reopened. The fire also knocked OVH’s other two facilities on the Rhine-side industrial park offline for an extended period.

The cloud operator has been working extra hard to get users’ servers back online again - a job made doubly hard by the fact that many had not taken out optional backup services, or worse still, didn’t realise they needed to. It's been harder to get information about what caused the fire because, OVHcloud says, the authorities are involved.

Issue 40 Front Cover.png

Issue 40: How data centers survived the Texas storm

Texas froze over, data centers burned down, and semiconductor fabs struggled with drought. The last three months have been chaos, but data center resiliency has helped the industry prevail.

Can we talk about this?

The problem is, fires in data centers are quite rare, and so are destructive outages. And information about those fires is even harder to come by.

A survey by the Data Center Incident Reporting Network (DCIRN) found only 31 reports of data center fires in the last 18 years. “Because of the industry’s secrecy there is no doubt that there were many more data center fire incidents that did not make the news,” commented DCIRN’s CEO Americas, Dennis Cronin.

To put that in context, we have 1.5 incident reports per year of data center fires. That's less than the number of outages we've reported this year due to rodents - and half the number of animal-related outages, given this year saw a beaver cause an outage in Canada.

Cronin found that not all of the fire reports were even actual fires. Sixty-five percent of the reports were real fires, but the others were misreported, or unfortunate incidents where something else set off the fire suppression system, which itself can cause destruction to hard drives and disruption to services.

They were disruptive, however. Downtime averaged 17.5 hours, which sounds quite tame compared with the OVHcloud incident, where it took weeks to restore some services, and some customers have lost data permanently.

So, given that the Strasbourg fire is at the extreme end of the spectrum, it’s instructive to see how OVHcloud has been handling the issue. Even with the best resiliency features, there’s always a finite chance that something like this could happen in any data center - though most operators would expect that their procedures would keep the risk to a level that is less than infinitesimal.

OVHcloud has been making a series of announcements about free backups for the future, and improved resilience in its infrastructure. Some of these announcements include the provision of offsite backups, while others seem to be about offering resiliency services similar to the regions and availability zones already provided by cloud operators such as AWS and Azure.

Evaluating the operator’s response to the actual fire, however, is frustrating as it’s still impossible to gauge how well it responded. As we said, OVHcloud has not yet been able to reveal definitively what the cause of the fire was.

Initial comments by founder Octave Klaba suggested that a UPS caught fire the day after it was given a routine service. Comments in the French press have focused on supposed weaknesses in OVHcloud’s building architecture and fire prevention.

But there’s no possibility of evaluating any of this yet because of those delays. OVHcloud says the incident report will be held back for months, because of the involvement of the French authorities, along with insurance companies and others.

That’s unfortunate. We’ve seen in the past that some data center incidents took a long while to understand, or report, and this delayed the possibility of the wider industry learning from those events.

For instance, a power surge and UPS issue took down the Singapore Stock Exchange in 2017: the report came out six months later in 2018. When BA had an outage over a bank holiday, which grounded 672 flights in 2017, it took two years and a lawsuit to settle where the blame lay.

If one of BA’s planes had crashed, the cause would be found by a legally agreed process involving a black box data recorder. If a data center crashes, users can be left in the dark, while interested parties argue behind closed doors.

It’s bad enough to have a destructive fire. If the details of that fire have to remain obscured by clouds of smoke long after the event, it is in no one’s interest.