design-patterns architecture distributed-computing cloud-hosting

What can we learn about building distributed systems from the recent Amazon EC2 outage?

Now the dust has settled, what can we learn about building distributed systems from the recent Amazon EC2 and Amazon RDS Service Outage?

Solution

Thanks for the interesting links. Obviously every distributed system is different and every outage is unique so it is difficult to generalise. Some takeways I have are:

Outages happen to even the best guys on the block...so you better plan for yours.
Building distributed systems is hard...so you need experience and experienced friends.
Manual changes are a common cause...not said explicitly in the AWS writeup, but strongly implied.
Outages are often "emergent" phenomena whereby a simple error causes many systems to interact in a way which grows exponentially. The AWS writeup refers to this as a "storm" and I have witnessed similar "storms" in large distributed systems. The degree of coupling and simple aspects like backoff parameters can make the difference between a disturbance that grows exponentially or decays exponentially. Think of the Tacoma Narrows bridge - perhaps the analogy is a stretch, but tuning of a few simple parameters can avoid destructive resonances.
The Netflix Chaos Monkey is interesting. The "Lean" guys have taught us that if something is difficult (like testing or deployment) then you should do it often until it aint difficult any more. Perhaps system failure/resilience is the next frontier for this approach.