Search code examples
amazon-web-servicesgoogle-chromeamazon-route53failover

AWS Route53 failover and chrome DNS cache


I have pretty standard failover with Route53:

  • Two regions
  • Primary, associated with a health check, and Secondary failover records for each region.
  • records point to the APIs. Also, we have front-end JS application which is using the API.

If Primary is unhealthy, then DNS returns a Secondary record, which is working perfectly, if user not using the application at the moment of failure.

So:

  • if Primary is unhealthy and user tries to use the app after failover has been activated - everything is ok (points to Secondary record)

  • if Primary becomes unhealthy when user is using the application, the application tries to access an old IP address, which is unavailable, so it is not switching to a secondary record.

Seems DNS is cached (can be checked here chrome://net-internals/#dns for Chrome). A user can continue to use the app after some time of an inactivity: when API was not triggered and Chrome's DNS cache is expired.

Is there any workaround for this particular case when Primary became unhealthy while a user is using the application? Or how can we make a user experience more pleasant in this case?

Added example:

  • User 1 is using the app (app is Ember.js app)
  • Primary is down and failover is activated
  • After that User 2 access the app (failover is active) and route53 provides a Secondary record, so everything is ok.
  • Meanwhile, User 1 is still trying to access the app, app making requests to API. But an app is accessing old IP from chrome DNS cache.

Added:

We are using Alias records (TTL of A records on Route53 are always 60 sec)


Solution

  • It all comes down to TTL. If you set TTL on your resource to 30 seconds the browser should resolve the address every 30 seconds so that should be acceptable for most cases. Of course that comes at cost of a bit of latency and a bit more costs (though R53 is really cheap). If you need shorter TTL you can set it up.

    If you wanted even more control over it you'd have to set up your own load balancer that would route to a different region when your machine goes down but that will not save you when EC2 fails (might buy you enough time to spin up a new instance though).