With the recent total loss of two data centers (single site) due to a fire at a cloud provider in France, the question arises as to how frequent, and impacting are data center fires?
To answer this question, we started roving the Internet to see how many published incidents of data center fires there were and what insights could we glean from those reports.
This opinion featured in the latest issue of the DCD Magazine. Read it for free today.
Sharing versus secrecy
All in all, from June of 2003 to March of 2021 (18 yrs.) we were able to identify 31 data center fire reports. The challenge was that details and specifics as to the causes, durations and impacts were difficult to come by. It seems that they only make the news when there is a fire or reports of a fire. If the causes are not readily known, follow-ups with more specific details seldom occur or the details are buried behind non-disclosure agreements (NDAs) prohibiting all from speaking out.
One would also expect, with the Loss Risk being so large, the insurance companies would demand data center operators take greater precautions and share experiences to teach better risk mitigation. Further, data center operations are inherently opposed to sharing information even if it means preventing future incidents. This culture of not sharing information on common issues made our research a bit more difficult and limited.
Here are a few stats of what we found:
- 65 percent or 20 were confirmed as real fires
- 23 percent or 7 of the 31 were reported as fires however there was no confirming data
- 6 percent or 2 of the 31 reports were at new data centers under construction
- 3 percent or 1 of the 31 was construction dust setting off a live data center fire suppression system
- 3 percent or 1 of the 31 was an interruption due to a fire suppression system test
Because of the industry’s secrecy there is no doubt that there were many more data center fire incidents that did not make the news but even so, we were able to identify an average of 1.5 major fires per year. Even more telling of the severity of your typical fire (excluding the recent incident in which two data centers were lost), the downtime averaged 17.5 hours due to fire incidents. And that was the time taken to get the facility operational - exclusive of the IT reboots that followed.
So, 1.5 data centers out of thousands of data centers around the world does not seem like much, unless of course it is yours. If it is, are you prepared for 17.5 hours of downtime before you can reboot servers? Don’t forget the 17.5 hours assumes the servers are not damaged by heat, soot, water, or fire. Then there are other factors like the data loss due to the hard crash when the power suddenly goes away.
Interestingly, of the incidents where the time the fire started was provided only 27 percent were in the PM hours. The other 73 percent started in the morning hours, where all but one started between 9:00 AM and 11:00 AM local time. This we find interesting because everyone expects incidents to happen on the graveyard shift when no one is around. Perhaps this statistic reinforces the need to keep people out of data centers because things always happen when people are around. While we did not find any contributing factors related to human activities the inference remains strong. Perhaps when more details are shared in the future, we will eventually be able to establish a statistical correlation.
Another factor we were able to look at was the frequency by year. Curiously, we found years 2011, 2014, 2015, and 2018 all tied with four fire incidents each. More importantly, since 2011 we have never gone more than three years with less than four fire incidents. If that holds true, then we are due to have two more fires by the time 2021 ends. Certainly, this is not something to look forward to.
To be real about all these facts and figures, the datasets are just not sufficient to be statistically reliable. For that we need much more data to achieve the degree of confidence that statistical analysis requires and assure the data is not skewed. This is what the Data Center Incident Reporting Network (DCIRN) hopes to do. is about. Launched in 2017, DCIRN aims to collate all incidents affecting the reliability and availability of data centers.
Achieving these goals will require the industry to start opening up and sharing common non-proprietary operational data providing insights to operational improvements. As data center incidents have more and more impact on people’s daily lives, the industry will have to stand up and contribute to this process or risk intervention by multiple governments with differing reporting requirements.
Leadership by example is a powerful tool and we need more everyday leaders like Octave Klaba, who can stand up in the face of disaster, muster his team to deliver herculean efforts in restoring services to their clients and commit to full transparency of the factual causes that lead to the total loss of two adjacent data centers to a fire. We will learn much from this incident, but this is only one data center site. The industry can learn so much more by sharing common data across all data centers.