Search code examples
nginxload-balancingnginx-config

Why does NGINX load balancer passive health check not detect when upstream server is offline?


I have an upstream block in an Nginx config file. This block lists multiple backend servers across which to load balance requests to.

...
upstream backend {
    server backend1.com;
    server backend2.com;
    server backend3.com;
}
...

Each of the above 3 backend servers is running a Node application.

  1. If I stop the application process on backend1 - Nginx recognises this, via passive health check, traffic is only directed to backend2 and backend3, as expected.
  2. However, if I power down the server on which backend1 is hosted, Nginx does not recognise that it is offline and continues to attempt to send traffic/requests to it. Nginx still tries to direct traffic to the offline server, resulting in an error: 504.

Can someone shed some light on why this (scenario 2 above) may happen and if there is some further configuration needed that I am missing?

Update: I'm beginning to wonder if the behaviour I'm seeing is because the above upstream block is located with an HTTP {} Nginx context. If backend1 was indeed powered down, this would be a connection error and so (maybe off the mark here, but just thinking aloud) should this be a TCP health check?

Update 2:

nginx.conf

user www-data;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;

events {
    worker_connections 768;
    # multi_accept on;
}

http {


       upstream backends {
          server xx.xx.xx.37:3000 fail_timeout=2s;
          server xx.xx.xx.52:3000 fail_timeout=2s;
          server xx.xx.xx.69:3000 fail_timeout=2s;
        }

    ##
    # Basic Settings
    ##

    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;
    # server_tokens off;

    # server_names_hash_bucket_size 64;
    # server_name_in_redirect off;

    include /etc/nginx/mime.types;
    default_type application/octet-stream;

    ##
    # SSL Settings
    ##
        ssl_certificate     …
        ssl_certificate_key …
        ssl_ciphers         …;
    ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
    ssl_prefer_server_ciphers on;

    ##
    # Logging Settings
    ##

    access_log /var/log/nginx/access.log;
    error_log /var/log/nginx/error.log;

    ##
    # Gzip Settings
    ##

    gzip on;

    # gzip_vary on;
    # gzip_proxied any;
    # gzip_comp_level 6;
    # gzip_buffers 16 8k;
    # gzip_http_version 1.1;
    # gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;

    ##
    # Virtual Host Configs
    ##

    include /etc/nginx/conf.d/*.conf;
    include /etc/nginx/sites-enabled/*;
}

default

server {
    listen 80;
    listen [::]:80;
    return 301 https://$host$request_uri;
    #server_name ...;
}
server {

    listen              443 ssl;
    listen              [::]:443 ssl;
    # SSL configuration
    ...
    # Add index.php to the list if you are using PHP
    index index.html index.htm;

    server_name _;

    location / {
        # First attempt to serve request as file, then
        # as directory, then fall back to displaying a 404.
                 try_files $uri $uri/ /index.html;
                 #try_files $uri $uri/ =404;

    }

        location /api {
            rewrite /api/(.*) /$1  break;
            proxy_pass http://backends;
            proxy_redirect     off;
            proxy_set_header   Host $host;
            proxy_set_header   X-Real-IP $remote_addr;
            proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header   X-Forwarded-Host $server_name;
         }

        # Requests for socket.io are passed on to Node on port 3000
       location /socket.io/ {
             proxy_http_version 1.1;

             proxy_set_header Upgrade $http_upgrade;
             proxy_set_header Connection "upgrade";

             proxy_pass http://backends;
        }
}

Solution

  • The reason for you to get a 504 is when nginx does HTTP health check it tries to connect to the location(ex: / for 200 status code) which you configured. Since the backend1 is powered down and the port is not listening and the socket is closed.

    It will take some time to get timeout exception and hence the 504: gateway timeout.

    It's a different case when you stop the application process.The port will not be listening and it will get connection refused which is identified pretty quick and marks the instance as unavailable.

    To overcome this you can set fail_timeout=2s to mark the server as unavailable default is 10 seconds.

    https://nginx.org/en/docs/http/ngx_http_upstream_module.html?&_ga=2.174685482.969425228.1595841929-1716500038.1594281802#fail_timeout