Nginx throws 502 bad gateway sometimes with upstream as Amazon ELB

ELB dynamically scales up and down and hence ELB may resolve to different set of IPs at different times. Nginx caches the IPs for upstream targets so that it doesn't need to resolve the hostname again and again. But when the IPs for an upstream ELB changes (i.e. some old IPs no longer part of the ELB), we face issues as Nginx continues to forward traffic to the old IP which no longer has any target attached. So, 502 or bad gateway is thrown by that IP (VM or whatever) and Nginx also returns the same 502 status to the clients. We may face this issue because of Nginx not honoring the TTL of the dns records.

Has anyone faced a similar issue , if yes, then what was the fix they tried.

Solution

This is known problem with many other customers. This can be solved by invalidating the cache of Nginx to see if IPs for an upstream ELB changes. We can use resolver directive from within Nginx. Refer - https://gc-taylor.com/blog/2011/11/10/nginx-aws-elb-name-resolution-resolvers , https://distinctplace.com/2017/04/19/nginx-resolver-explained/