AWS lost an availability zone in Frankfurt for three hours when air circulation systems failed. The normally routine incident escalated when a fire suppression system was triggered and the data center had to be evacuated.
The fire suppression system removed oxygen from the air, so for a period of around an hour staff could not enter the data hall to fix the fault, making the outage longer. All systems are operating normally now, according to the Amazon Web Services status page. Because this was just one availability zone, the impact on customers was limited.
For breaking data center news, features, and opinions, subscribe to DCD's newsletter
Human suppression system
The breakdown began at 1318 PDT when connectivity issues for EC2 instances and high error rates began to be reported. The root cause was a system failure which took out air handlers, and allowed air temperatures to rise.
"Servers and networking equipment in the affected Availability Zone began to power-off when unsafe temperatures were reached," says Amazon's report on the incident. This became more serious when multiple redundant switches shut down: "A larger number of EC2 instances in this single Availability Zone lost network connectivity."
AWS staff could have easily fixed the air handling problem before any IT services were impacted, if not for one problem, said AWS: "While our operators would normally have been able to restore cooling before impact, a fire suppression system activated inside a section of the affected Availability Zone."
This suppression system is meant to activate when it detects smoke, so it should not have been activated by the raised temperature of the facility. However, because it went off, the data center was "evacuated and sealed." Also, a chemical was released which would remove oxygen - and would have put out any fire - if there had been one.
Since a fire alarm had been raised, AWS could not do anything for some time. First, the fire department had to determine the site was safe, and then the site had to be made habitable by humans once more: "In order to recover the impacted instances and network equipment, we needed to wait until the fire department was able to inspect the facility. After the fire department determined that there was no fire in the data center and it was safe to return, the building needed to be re-oxygenated before it was safe for engineers to enter the facility and restore the affected networking gear and servers."
Once the cooling was available again, the servers and switches were turned on, and all instances recovered quickly, except for a very few volumes that were adversely affected, AWS reports. "We continue to work to recover those last affected instances and volumes, and have opened notifications for the remaining impacted customers via the Personal Health Dashboard. For immediate recovery of those resources, we recommend replacing any remaining affected instances or volumes if possible.
Meanwhile, that recalcitrant fire suppression system has been disabled. It is designed to require smoke to activate, so it should not have discharged.
"This system will remain inactive until we are able to determine what triggered it improperly," says AWS.
Does this mean, the data center has a slightly higher risk of fire now? Not according to AWS, which says "alternate fire suppression measures are being used to protect the data center".
That is an important footnote because, while data center fires are mercifully rare this year has seen a very serious conflagration at OVHcloud's Strasbourg site, which eliminated two data centers.
Fire suppression systems are clearly necessary, but incidents where they fail and cause outages themselves are all too frequent. In 2017, an incident caused an Azure outage, in 2018, a fire suppression failure took out a DigiPlex data center and the Nordic Nasdaq, and the State of New Jersey's data center also went down in 2018 in a similar incident.