Amazon Web Services S3 outage due to human error

Amazon has shared more details on what caused the significant AWS outage earlier this week.

On Tuesday morning, around 9am Pacific Time, an AWS team member mistyped a command when debugging the S3 billing process, and accidentally removed crucial subsystems, causing internal - and external - chaos.

How to kill the Internet

“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” Amazon said in a post-mortem.

“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.”

One of the subsystems was the index subsystem, which manages the metadata and location information of all S3 objects in the US-EAST-1 region. “This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests,” the company said.

The other subsystem was the placement subsystem, which manages the allocation of new storage and requires the index subsystem to be functioning.

Amazon said: “Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.”

The cloud company said that AWS had not completely restarted the index subsystem or the placement subsystem in its major regions for some years. “S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.”

AWS plans to make several changes to its processes, including limiting the amount of capacity that can be removed that quickly, as well as preventing capacity from being removed when it will take any subsystem below its minimum required capacity level.

It added: “We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem.

“As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.”

The AWS Service Health Dashboard, which was unable to update customers on the state of their infrastructure due to it being dependent on Amazon S3, has been changed to run across multiple AWS regions.

It happens

Human error has long been one of the main causes of outages and service disruptions for Internet infrastructure, perhaps second only to squirrels. Employees were to blame for such accidental problems as the Nasdaq going down, Azure crashing, and the majority of European Internet traffic being sent to Hong Kong.

Just last month, a sysadmin at source-code hub GitLab.com accidentally deleted the wrong directory, wiping 300GB of live production data. None of the five backup/replication processes deployed were successful.

On Twitter, Microsoft Technical Fellow Jeffrey Snover offered some solace to those that had made similar mistakes, and gave a helpful tip on how to avoid such calamities:

I’m a Tech Fellow
I once typed a bad cmd deleting most of the OS sources @ Apollo
I added -WHATIF & -CONFIRM in PS to help others
USE THEM! pic.twitter.com/HsG0YTJhBH
— Jeffrey Snover (@jsnover) March 3, 2017

Amazon Web Services S3 outage due to human error

How to kill the Internet

It happens

Tags

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies

Success story: Kao Data and Cadence