When fire engulfed OVHcloud's SBG2 data center in Strasbourg this week, the whole site was shut down, and the service provider's founder Octave Klaba tweeted: "We recommend to activate your Disaster Recovery Plan."
Just over a day later, some OVHcloud customers have lost data permanently, and some websites are still offline (including the prestigious Centre Pompidou in Paris). Most people are expressing sympathy for OVHcloud, and relief that no one was hurt in a pretty apocalyptic conflagration. But others have been calling for compensation - and somewhat ironically, these include gamers who get their kicks struggling for survival in the dystopian hell-world of the Rust game.
OVH fire: OVHcloud abandons efforts to restart SBG1 in Strasbourg
For more breaking data center news, features, and opinions, subscribe to DCD's newsletter
To activate a plan, you must have one
Klaba's words are the voice of reason here - and actually could be a timely reminder of things that some people might forget. When you do anything, you should be aware of the risks.
Data centers are so reliable, that customers have come to expect them to always be there. Our WiFi and broadband can wobble, and e-commerce sites can fail to take our orders or miss deliveries, but people expect Google to always have their mail, Facebook to have their pictures ready at a single click, and chess servers to keep their games safe.
Those in the industry know better - or at least they should. The very existence of uninterruptible power supplies (UPSs) and redundant feeds is a sign that we know things can go wrong, and fire prevention systems are there because fires can happen. Across the industry, we may be very close to 100 percent reliability, but 100 percent reliability is an ideal of perfection which we can only approach asymptotically.
A disaster like this should not happen. When the debris is fully sifted, we will find out what caused it, and it will sadly be something which could have been avoided. However, it's a scientific fact that complex human-technical systems are complex and will have a failure rate. Things like this will inevitably happen from time to time, or to put it simply: "Accidents happen."
It's clear OVHcloud is pulling out the stops to fix everything that can be fixed - that's what we would expect from any service provider. But everyone should have disaster recovery plans.
When you sign up to a service provider, they will tell you (or at least they should) that they provide a best-efforts service. Their statistics are great, and they can offer services with additional reliability or improved support, but they can't guarantee nothing will go wrong. Some level of backup and disaster plans will be your responsibility.
The trouble is, a disaster plan needs to consider all the risks, and take appropriate action according to their probability. It's not always clear what those risks are.
A lot of the people most seriously affected ran their own dedicated "bare metal" servers at the OVH data center, instead of virtual servers in a cloud. That's a decision they made, which gave them access to more performance on dedicated hardware, and maybe a higher perceived privacy. However, while OVHcloud can keep backups of the virtual machines in its cloud, users with bare metal servers don't get that service.
Understand the risks
"What seems to be lost is customers who had VPS [virtual private server] or dedicated server without backups," tweeted Swiss entrepreneur Kalle Sintonen after the fire. "OVH data is always saved in an other location as well.."
The twitter thread is educational. It takes Kalle two goes to explain it: "VPS and dedicated servers are managed by the customer, not OVH. So it's the customer's failure management in place"
Bare-metal instances customers shouldn't keep the family jewels on those servers. If they have something there that needs backing up, they should make sure to back it up. And they should understand what risks they are protecting against, when they decide how to back the data up.
Some OVHcloud customers will have only considered hard drive crashes or memory failures, and backed up the data to another server... in the same building.
It's easy to be wise after the fact, and some people will have to come to terms with the fact that they made choices - perhaps unwittingly or unconsciously - that their data and their sites only deserved a certain level of reliability.