For news on the most recent Amazon outage, go here.
Amazon has experienced more problems at its North Virginia data center, with some customers in its US-EAST-1 Region experiencing downtime until 12:32 PDT yesterday.
Cloud application platform Heroku experienced downtime, along with Reddit and Netflix, among others according to company status reports.
Reddit said the issue “appears to be the network-related”.
Services were lost for several hours, according to a blog post on Systems Watch, a site designd t to “provide transparency into modern day network computing”.
On its service dashboard, Amazon said customers of EC2 and RDS in North Virginia were experiencing “elevated error rates” for its CloudSearch services.
At 10:38 yesterday PDT, Amazon said it was investigating degraded performance for a number of its EBS volumes in its Availability zone in the region.
An hour later it said it was experiencing connectivity issues and degraded performance for a small number of instances.
It said new launches of its EBS-backed instances were failing, leading to degraded performance.
By about 2:20 pm it said about half of the instances had been restored.
It appears all is up and running again, but this has not been the first time EC2 has experienced problems.
Back in July, Amazon Web Services blamed a major outage on the failure of a back-up generator when the facility experienced problems with power supply.
“Once the power outage hit, triggered by a large-scale electrical storm that had gone through northern Virginia on 29 June, electrical load in one of more than 10 Amazon data centers in the region failed to successfully transfer to generator backup. Uninterruptible power supply (UPS) serving that particular circuit eventually ran out of battery power and servers began losing power,” FOCUS reported in July.
“While promising to take steps to repair the equipment involved, lengthen failover time and expand around-the clock engineering staff, Amazon said it would not replace the offending generator.”
And in June, another outage gook down a number of web services including Heroku and Fourquare. Amazon said this was due to a cable fault in its utility power-distribution system and that a back-up generator in a data center had powered off because of a defective cooling fan.
In its blog post, Systems Watch said many of the customers who were affected in this most recent outage were located in multi-availability zones due to a “single-zone issue”.
“One of the most common complaints is that the glue that holds Amazon Web Services together is their management API stack. Some use the literal programmatic API libraries, others maybe AWS Management Console, and some RightScale, however they all are layers built on the AWS API Platform and when that goes all bets are off,” the blog post reads.
“The problem scenario occurs when you have an architected system across two or more AWS availability zones in Virginia (us-east), one of the biggest and oldest regions of AWS, and AWS has an outage in one zone specifically in that region.
“The waterfall effect seems to happen, where the AWS API stack gets overwhelmed to the point of being useless for any management task in the region. Our guess is that either through AWS and client automated failover/scale up or due to the outage itself the API stack cannot handle the request load of all those affected, which leaves AWS users in the region unable to fail-over or increase resources to an unaffected zone.”
Amazon Web Services is yet to provide more detail on how this outage occurred.