Search code examples
amazon-web-servicesamazon-ec2amazon-rdshigh-availabilityaws-vpc

Common AWS failures - Handling AZ failover


Specifically I have a question what is the recommended way to organize AZ failover in AWS environment. Also it will be good to understand typical AWS failures in order to organize Application HA (High Availability). So, Application architecture (AWS services usage) is following: It's more/less typical Web Applications architecture in the AWS

  1. There is route 53 that resolves ip of some ELB.
  2. There is public subnet that has ELB and it routes traffic to Web Servers to private VPC;
  3. In the private subnet traffic goes: Web Servers -> ELB-> Application Servers;
  4. Application Servers writes data to Multi-AZ RDS.

The main drawback with such deployment that services are active in one AZ because in a Multi-AZ deployment, Amazon RDS automatically provisions and maintains a synchronous standby replica in a different Availability Zone. So, master is only in one AZ and services in another AZ is not allowable to write to RDS because it's standby.

Two questions:

  1. What is the better way to implement HA for such deployment?
  2. What is the common AWS failures (if one AZ is unavailable whether it's often happens only with some services (e.g. VPC/EC2/EBS other issues?)or usually it's whole AZ specific services are not available)?

Considerations about HA for such approach:

  1. RDS. From AWS docs: "In the event of a planned or unplanned outage of your DB instance, Amazon RDS automatically switches to a standby replica in another Availability Zone if you have enabled Multi-AZ. The time it takes .....". So, AWS Automatically will change RDS Master.
  2. Active/Not active AZ. Different health checks can be added to Route53 and basically make Active another AWS AZ. But How to make it synchronously with RDS (only after RDS becomes master in another AZ make this AZ active)?

Update Another reason to maintain one active and one passive AZ is that our application servers should support stickiness by device IP address (e.g. It keeps session based on user's or device's IP). And we have 1 EC2 Web Server instance in each AZ that maintains it (we can't allow to go requests to different AZ(s)).


Solution

  • I think you misunderstand how availability zones work. Services in one AZ can connect to the RDS master in a different AZ. You should have all services running in at least 2 AZs.

    For RDS, when then master fails or the AZ the master is in goes down, the RDS service will promote the standby to master and update the DNS for the RDS endpoint so that the endpoint will then point to the new master.

    All you code needs to do in order to handle an RDS failover is to gracefully handle sudden DB disconnects with a retry.