Facebook has provided more details on its seven-hour outage this Monday, which also brought down Instagram, WhatsApp, and Oculus.
During the outage, company employees were locked out of their offices and unable to use the Facebook-based security log in, making recovery more difficult.
During routine maintenance of Facebook's global backbone network, "a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally," the company's VP of infrastructure Santosh Janardhan said in a blog post.
"Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command."
The command caused a complete disconnection of Facebook's server connections between its data centers and the Internet. "And that total loss of connection caused a second issue that made things worse," Janardhan said.
He explained: "One of the jobs performed by our smaller facilities is to respond to DNS queries. DNS is the address book of the internet, enabling the simple web names we type into browsers to be translated into specific server IP addresses. Those translation queries are answered by our authoritative name servers that occupy well known IP addresses themselves, which in turn are advertised to the rest of the Internet via another protocol called the border gateway protocol (BGP)."
Facebook's DNS servers disable BGP advertisements if they themselves can not speak to the data centers, since this is an indication of an unhealthy network connection. With this outage, the entire backbone appeared unhealthy causing the BGP advertisements to be withdrawn.
"The end result was that our DNS servers became unreachable even though they were still operational," Janardhan said. "This made it impossible for the rest of the Internet to find our servers."
This compounding series of failures happened very fast, the company said. Engineers were unable to access Facebook's data centers through normal means because their networks were down, and the total loss of DNS broke many of the internal tools they would normally use to investigate and resolve outages like this.
The company sent engineers to data centers, but it took time to get into the secure facilities with access systems down.
"They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them," Janardhan said. "So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online."
Once the backbone was restored, the service was ready to resume - but the team feared crashes from the coming surge in traffic. Janardhan said: "Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk."
Here, the company followed its previous 'storm' drills on how to recover from a major outage, allowing the company to bring the platform slowly back online without incident.
"We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making," Janardhan said. "I believe a tradeoff like this is worth it - greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this."