Archived Content

The following content is from an older version of this website, and may not display correctly.
Infrastructure engineers at Dropbox have added a “layer” of additional checks to prevent outages like last Friday's and took measures to speed up recovery of the service if outages do occur in the future.
 
Following standard practice, the popular online storage service uses multiple production-workload replicas for redundancy. When the replicas fail, they have to be restored from backup.
 
The company's user base has grown so much over the past few years – it now supports hundreds of millions of users – that its MySQL data sets are now too large to restore quickly enough. Following the outage, its engineers built a tool that parallelizes replay of binary logs (logs of database changes) which they say speeds up the recovery process.
 
Dropbox plans to open-source the tool.
 
The outage happened because of a bug that installed a routine server upgrade on several active servers, Atidya Agarwal, Dropbox VP of engineering, wrote in a blog post. The entire service went down as a result.
 
Agarwal assured users that their files were safe throughout the three-hour outage and that there had not been any hacking or Distributed Denial of Service (DDoS) attacks on the service. This was apparently an attempt to address multiple claims of responsibility for the outage by hackers.
 
Agarwal concluded the post by apologizing and pointing out the company's follow-up activities to prevent such incidents from happening in the future. “We’re currently building more tools and checks to make sure this doesn’t happen again.”
 
While the complete Dropbox black-out lasted three hours on Friday evening, the core service was not fully restored until late Sunday afternoon, according to a “post-mortem” on the company's Tech blog by Akhil Gupta, its head of infrastructure.
 
The bug that brought the infrastructure down was in a script that checks whether or not a server has active data before upgrading it. The script ran during a maintenance period scheduled to upgrade operating systems on some servers and reinstalled a few active machines.
 
Dropbox is run on thousands of databases, Gupta wrote. For each database, there is a master server and two redundant “replica” servers. Some of these master-replica pairs were affected by the upgrade, which brought the service down.
 
The infrastructure team has now implemented a process where servers verify locally whether they are running production workloads or not before executing commands. “This enables machines that self-identify as running critical processes to refuse potentially destructive operations,” Gupta wrote.
 
The databases do not contain users' file data, so the outage did not put user files at risk. They are used to provide features, such as photo album sharing or camera uploads.
 
The service was restored from backups. “We were able to restore most functionality within three hours, but the large size of some of our databases slowed recovery,” Gupta wrote.
 
There have been media reports citing anonymous sources that said Dropbox was preparing to launch an initial public offering this year.