Amazon Web Services has detailed why its US-east-1 cloud region suffered a major outage on 7 December.
AWS said that it uses an internal network to host foundational services including monitoring, internal DNS, authorization services, and parts of the EC2 control plane. This network collapsed due to unexpected behavior by automated systems.
In a post mortem, the company said that the internal network is connected with multiple geographically isolated networking devices that provide additional routing and network address translation that allow AWS services to communicate between the internal network and the main AWS network.
"At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network," the company said in the report.
"This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks."
The congestion also broke real-time monitoring, making it hard for the internal operations teams to understand what was happening - which is why employees at the time thought it could be an external attack.
"Operators instead relied on logs to understand what was happening and initially identified elevated internal DNS errors," the company said. "Because internal DNS is foundational for all services and this traffic was believed to be contributing to the congestion, the teams focused on moving the internal DNS traffic away from the congested network paths. At 9:28 AM PST, the team completed this work and DNS resolution errors fully recovered."
This improved matters, but did not resolve them. It took until 2:22 PM PST for network devices to fully recover after several remedial actions.
"We have taken several actions to prevent a recurrence of this event," Amazon said. "We immediately disabled the scaling activities that triggered this event and will not resume them until we have deployed all remediations. Our systems are scaled adequately so that we do not need to resume these activities in the near-term."
It said that the networking clients have well-tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but that this did not happen as "the automated scaling activity triggered a previously unobserved behavior."
The company apologized for the outage, and for its lack of communication during the event. It blamed not updating its Service Health Dashboard due to its inability to use its monitoring systems.
"We expect to release a new version of our Service Health Dashboard early next year that will make it easier to understand service impact and a new support system architecture that actively runs across multiple AWS regions to ensure we do not have delays in communicating with customers," the company said.
The outage took out everything from Disney+ to Tinder to Amazon's own warehouse logistics network.