Search code examples
web-servicestestinginfrastructure

Writing a 'Chaos Monkey' to increase resilience


Apologies for the rather open nature of the question, but I think its a very valuable area of discussion.

Following the recent AWS outage and the huge number of horror stories that followed it, I was really impressed by the Chaos Monkey 'technique' applied by Netflix (one of the few to survive pretty much without a scratch.

For those who don't know the concept, it is essentially a little bot that goes around your infrastructure, causing chaos along the way, as a way of continuously testing resilience.

Besides Jeff Atwood's Chaos Monkey post I've been able to find little on this being employed anywhere else.

Whilst I appreciate that good test-driven development is a solid foundation, I think that this would be a great addition to the arsenal of any company/organisation that wants to stay up.

  • Has anyone else approached this topic before?
  • Are there particular areas other than connectivity and security vulnerability that you would see such a piece of code hitting?
  • Any other thoughts/feelings on this approach?

Solution

  • There are several tests you could do to stress your system. I like to use apache bench to load test a page that writes to the database. I test it both for number of hits and concurrent users

    500 concurrent users making a total of 5000 requests
    $ ab -n 5000 -c 500 url

    I know my webserver can stand up to this, but I found a problem with how I was logging information. You could point that a different aspects of your site.

    If you use caching you could clear the cache in the middle of the testing to see that everything recovers quickly.

    If you can replicate your server in a VM, change amount of RAM, unmount a hard disk, run out of disk space, disconnect network interface, etc.

    You could try to brute force a password and make sure your system only allows n login attempts before rate limiting that user.