Search code examples
amazon-web-servicesroutesdnsamazon-eksnlb

AWS DNS resolution across AZs


Here's the setup: An http request travels from the client connected to the Client VPN (NAT), to a private Hosted Zone in Route53 where the A record resolves to a Network LB DNS name which forwards the traffic to EKS nodes via their AWS DNS names.

The EKS is deployed across 2 Availability Zones of the possible 3 - lets call them 1,2 and 3, per their assigned subnet CIDR ranges.

The issue that occurs is this (According to wireshark)

  • The Client VPN private IP requests the A record from the Hosted zone, which comes back with the private addresses of the LB from the 3 AZs - Success

  • The client continues to issue a TCP request against the address from the subnet 1. This times out.

  • The client then sends a TCP request against the subnet 2 which succeeds and the site resolves.

If the TCP requests first get sent to 1 and 3, the site will not resolve.

At the same time, if the site is requested via a Public Hosted zone and an Internet facing LB, the site resolves without issue regardless of the AZ.

For the life of me I can't figure out why this round-robinesque behavior is happening, but more importantly, why can't the site resolve from another Availability zone?

What I've tried Recreating the LB with only 2 AZs - this only decreases the site load time by half, doesn't solve the problem.

Checked security groups - once inside the VPN, everything is accessible on the private subnet.

Checked routes - There are routes from the VPN endpoint to all 3 AZ subnets.

as per https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html, the Internal load balancers don't support loopback or hairpinning and AWS advises to use an Internet facing LB, which won't give out a private IP and the A records won't resolve via the public IP. Secondly, they advise registering targets by IP and not Instance ID, which won't work for me as the private IPs of the EKS nodes will change in the future as EKS gets upgraded.


Solution

  • Based on the comments.

    The issue was caused by dissabled cross-zone load balancing. By default it is off.

    Enabling the cross-zone balancing was the solution to the problem.