The original trigger was a leeeetle mistake when reconfiguring network connectivity for some planned work. Primary network traffic was redirected to a network with inadequate capacity, resulting in the servers losing the vital network connections they need to remain in synch as part of a cluster. This in turn triggered the servers to try to re-synch, which exacerbated the network performance constraint until the house of cards fell.
It caught my eye that Amazon's cloud-based relational database service was impacted by the incident:
"In addition to the direct effect this EBS issue had on EC2 instances, it also impacted the Relational Database Service (“RDS”). RDS depends upon EBS for database and log storage, and as a result a portion of the RDS databases hosted in the primary affected Availability Zone became inaccessible."It also caught my eye just how much resilience is built-in to Amazon's cloud architecture - and yet all that technical brilliance was foiled by a config error, presumably just an unfortunate typo by some network operator having A Bad Day (been there, done that!). These things happen, but designing architectures and processes to be resilient to such operator or indeed user errors is at least as challenging as taming the technology.
Finally, the level of detail in the post-incident report published by Amazon is telling. It is, I suspect, a somewhat sanitized version of a more detailed internal technical report. It describes a complex sequence of events that someone has had to reconstruct from the system logs, alarms and alerts, and no doubt a confession by a red-faced network op. It's an elegant example of the value of forensics. Thank you Amazon for sharing it with us. [If this level of humility and graphic detail from the suppliers turns out to be characteristic of information security incidents affecting cloud services, then cloud security has just gone up a notch in my estimation.]