Why is my NGINX to pm2 upstream slow when restarting?

I run a home server with nginx reverse proxied to a Node.js/PM2 upstream. Normally it works perfectly. However, when I want to make changes, I run pm2 reload pname or pm2 restart pname, which results in nginx throwing 502 Bad Gateway for about 10-20 seconds before it finds the new upstream.

My Node.js app starts very fast and I am 99% sure it is not actually taking that long for the upstream to start and bind to the port (when I don't use the nginx layer it is accessible instantly). How can I eliminate the extra time it takes for nginx to figure things out?

From nginx/error.log:

2021/01/29 17:50:35 [error] 18462#0: *85 no live upstreams while connecting to upstream, client: [ip], server: hostname.com, request: "GET /path HTTP/1.1", upstream: "http://localhost/path", host: "www.hostname.com"

From my nginx domain config:

server {
        listen 80;
        server_name hostname.com www.hostname.com;
        return 301 https://$host$request_uri;
}

server {
        listen 443 ssl;
        server_name hostname.com www.hostname.com;
        # ...removed ssl stuff...
        gzip_types      text/plain text/css text/xml application/json application/javascript application/xml+rss application/atom+xml image/svg+xml;
        gzip_proxied    no-cache no-store private expired auth;
        gzip_min_length 1000;
        location /  {
                proxy_pass    http://localhost:3010;
                proxy_http_version 1.1;
                proxy_set_header Upgrade $http_upgrade;
                proxy_set_header Connection 'upgrade';
                proxy_set_header Host $host;
                proxy_cache_bypass $http_upgrade;
                proxy_set_header X-Forwarded-For $remote_addr;
                proxy_read_timeout 240s;
        }
}

Solution

This is caused by the default behavior for an upstream, this may not be obvious since you're not explicitly declaring your upstream using the upstream directive. Your configuration with an upstream directive would look like this:

upstream backend {
        server localhost:3010;
}

...

server {
        listen 443 ssl;
        ...
        location /  {
                proxy_pass    http://backend;
                ...
        }
}

In this form it's apparent you're just relying on the default options for the server directive. The server directive has many options, but two of them are important here: max_fails and fail_timeout. These options control failure states and how nginx should handle them. By default max_fails=1 and fail_timeout=10 seconds, this means that after one unsuccessful attempt to communicate with the upstream nginx will wait 10 seconds before attempting again.

To avoid this in your environment you could simply disable this mechanism by setting max_fails=0:

upstream backend {
        server localhost:3010 max_fails=0;
}