Search code examples
ruby-on-railsnginxthin

Upstream times out with Nginx, Thin/Rails while reading response header from upstream


I'm running Nginx to pass requests to two Thin servers. The site works about 90% of the time and then every once in a while it just becomes unresponsive and I get errors' like this

2014/11/28 21:40:05 [error] 21516#0: *1458 upstream timed out (110: Connection timed out) while reading response header from upstream, client: X.X.X.X, server: www...com, request: "HEAD / HTTP/1.1", upstream: "http://127.0.0.1:5001/", host: "www.example.com", referrer: "http://www.example.com/"

I searched for solutions online but unfortunately most of them have to do with this problem when it's happening all the time. In that case it usually means Thin is simply just not running. In my case my site works most of the time. I can ping the Thin servers themselves. Although I haven't tried pinging them while the site is unresponsive. This might give me some more insight into the problem.

Here is my nginx.conf and sites-available

user www-data;
worker_processes 2;
pid /var/run/nginx.pid;
events {
  worker_connections 768;
  multi_accept on;
}
http {

  sendfile on;
  tcp_nopush on;
  tcp_nodelay on;
  keepalive_timeout 70;
  types_hash_max_size 2048;

  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  client_max_body_size 100M;

  access_log /var/log/nginx/access.log;
  error_log /var/log/nginx/error.log;

  gzip on;
  gzip_disable "msie6";

  gzip_static on;

  gzip_comp_level 6;

  gzip_types text/plain text/css application/json application/x-javascript text/xml application/xml application/xml+rss text/javascript application/octet-stream;

  ssl_session_cache shared:SSL:10m;
  ssl_session_timeout 10m;

  include /etc/nginx/conf.d/*.conf;
  include /etc/nginx/sites-enabled/*;
}

And sites-enabled/default

map $http_upgrade $connection_upgrade {
 default Upgrade;
 ''      close;
}
upstream example {
  server 127.0.0.1:5000;
  server 127.0.0.1:5001;
}
upstream websocket {
  server 127.0.0.1:5001;
}
server {
  listen 80;
  listen 443 ssl;
  keepalive_timeout 70;
  root /data/example/;
  index index.html index.htm;
  server_name www.example.com;
  ssl_certificate     <PATH>;
  ssl_certificate_key <PATH>;

  location ~ ^/assets/ {
    include /etc/nginx/block.list;
    expires 1d; 
    add_header Cache-Control public;
    add_header ETag ""; 
    proxy_set_header  X-Real-IP  $remote_addr;
    proxy_set_header  X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header  X-Forwarded-Proto $scheme;
    proxy_set_header Host $http_host;
    proxy_redirect off;
    proxy_pass http://example;
  }

  location /websocket {
    include /etc/nginx/block.list;
    proxy_set_header  X-Real-IP  $remote_addr;
    proxy_set_header  X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header  X-Forwarded-Proto $scheme;
    proxy_set_header Host $http_host;
    proxy_redirect off;

    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;
    proxy_pass http://websocket;
  }

  location / {
    include /etc/nginx/block.list;
    proxy_set_header  X-Real-IP  $remote_addr;
    proxy_set_header  X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header  X-Forwarded-Proto $scheme;
    proxy_set_header Host $http_host;
    proxy_redirect off;

    proxy_pass http://example;
  }
}

The last thing I did was remove some if statements in the config file. I'm not sure this did anything. The problem hasn't happened since then but I don't think it's been long enough.

EDIT: The problem did return. I'm back to square one.

if (-f $request_filename/index.html) {
   rewrite (.*) $1/index.html break;
}
if (-f $request_filename.html) {
    rewrite (.*) $1.html break;
}
set $flags ""; 
if (!-f $request_filename) {
    set $flags "${flags}R";
}
if ($flags = "R") {
    proxy_pass http://example;
    break;
}

Solution

  • The solution turned out to be simple. This issue took me longer to figure out than it should have.

    It turns out Google Compute Engine has a firewall rule to disconnect idle TCP connections after 10 minutes. This meant that Thin's connection to the database was being disconnected.

    However, Thin was not throwing a timeout error. This made it hard to determine the source of the Nginx timeout. So maybe this is a bug in Thin, I'm not sure but I did have a timeout parameter set in Thin configuration for a short period.

    The solution itself is to set keepalive settings to keep TCP connections alive even when idle. See here for details: https://cloud.google.com/compute/docs/troubleshooting#communicatewithinternet.