Thursday, April 28, 2011

Outage at Amazon Web Services Caused by Network Configuration Error

In the early hours of April 21st, a configuration error during a network upgrade at its US East Region data center led to a cascading series of failures that ultimately brought down the Amazon Elastic Compute Cloud (EC2).
The trigger event was a configuration change meant to upgrade the capacity of the facility's primary network.

Instead of temporarily re-routing traffic to a redundant router in the primary network, the configuration change shifted traffic onto a lower capacity, redundant Amazon Elastic Block Store (EBS) network. The secondary network couldn't handle the traffic level and many EBS nodes in the affected Availability Zone were completely isolated from other EBS nodes in its cluster.

Amazon Web Services has published a lengthy memo on the incident. The company has also issued credits for affected customers and apologized for outage.

Going forward, AWS said it would make it easier for companies to its geographically distributed Availability Zones, which are completely isolated and independent of each other. The company also plans to invest in increasing its visibility, control, and automation to recover volumes in the event of another disaster.

See also