Search code examples
javaamazon-web-servicesredisamazon-elasticache

Auto failover handling in cluster mode disabled Redis ElastiCache


I want to understand the failover case from the perspective of node endpoints https://forums.aws.amazon.com/, IP addresses for the cases

  1. Master failover and replica gets promoted
  2. Replica failover

The configuration is as Cluster mode disabled : Only 1 shard with (1 master and 2 replicas) with multi AZ enabled. For example -

PRIMARY ENDPOINT - xxx.dktrm8.ng.0001.usw2.cache.amazonaws.com
READER ENDPOINT - xxx-ro.dktrm8.ng.0001.usw2.cache.amazonaws.com
NODE ENDPOINTS - {
xxx-a.dktrm8.0001.usw2.cache.amazonaws.com -> master,
xxx-b.dktrm8.0001.usw2.cache.amazonaws.com -> replica,
xxx-c.dktrm8.0001.usw2.cache.amazonaws.com -> replica
}

Ques -

  1. Are the node endpoints are DNS names ?
  2. Failover has been handled on IP address level or node endpoints level ?
  3. After failover can primary endpoint point to different node endpoint(promoted master) or just the IP address mapping gets changed ?
  4. If one is using node endpoints for read traffic instead of reader endpoint, is it possible that the node endpoint's role get changed to 'MASTER' ? in case of failover or maintenance.

Solution

  • First, yes, those are DNS names that will get an answer to A queries, as are the master/replica GSLB's (I don't think they're really GSLB's in the same context as you'd use for say a web application, but they ensure the Primary node is always at the master endpoint, and the replicas are always behind the replica endpoint).

    Secondly, since the actual node names (and not the pseudo-GSLB's) resolve to the IP addresses, it doesn't matter which you use.

    After a failover, both the master/primary and replica GSLB endpoints will update. The master/primary endpoint will point to the replica that was promoted to primary. The replica endpoint will temporarily contain only one replica, the one that wasn't promoted. As soon as the original master/primary endpoint comes back online, it will be reconfigured as a replica and then added to the replica GSLB endpoint. This is assuming there are 3 nodes total. If there's more, the replica endpoint will have one less node until the original primary comes back online.

    Finally, yes, you should always use the GSLB endpoints as they will always have the most up-to-date primary and replicas within. If you connect to the nodes directly you run the risk of attempting to write to a primary that becomes a replica, a replica that becomes a primary, or a node that is just offline. Unless all three (or more if using more nodes) are offline, the primary and replica endpoints will always point to the right place.