amazon-web-services amazon-elastic-beanstalk elastic-load-balancer

Spurious 504s from AWS classic load balancer

I have a REST API running on AWS Elastic Beanstalk that work well most of the time. However, every few hours it hiccups by returning a 504 on a single request. Here's the AWS Elastic Load Balancer (classic) log:

2018-03-04T21:07:00.151327Z awseb-e-x-AWSEBLoa-abc123 xxx.xxx.xxx.216:57324 - -1 -1 -1 504 0 2497 0 "POST https://my.api.com:443/v1/data/add HTTP/1.1" "-" ECDHE-RSA-AES128-SHA TLSv1

Here's the log in context:

2018-03-04T21:07:54.884768Z awseb-e-x-AWSEBLoa-abc123 xxx.xxx.xxx.216:57339 xxx.xxx.xxx.85:80 0.000041 0.134478 0.000084 200 200 2672 93 "POST https://my.api.com:443/v1/data/add HTTP/1.1" "-" ECDHE-RSA-AES128-SHA TLSv1 2018-03-04T21:07:55.935722Z awseb-e-x-AWSEBLoa-abc123 xxx.xxx.xxx.216:57342 xxx.xxx.xxx.85:80 0.000067 0.107369 0.000075 200 200 5538 93 "POST https://my.api.com:443/v1/data/add HTTP/1.1" "-" ECDHE-RSA-AES128-SHA TLSv1 2018-03-04T21:07:56.633812Z awseb-e-x-AWSEBLoa-abc123 xxx.xxx.xxx.226:33815 xxx.xxx.xxx.85:80 0.000041 0.149562 0.000079 200 200 332 93 "POST https://my.api.com:443/v1/data/add HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 2018-03-04T21:07:00.151327Z awseb-e-x-AWSEBLoa-abc123 xxx.xxx.xxx.216:57324 - -1 -1 -1 504 0 2497 0 "POST https://my.api.com:443/v1/data/add HTTP/1.1" "-" ECDHE-RSA-AES128-SHA TLSv1 2018-03-04T21:08:00.521384Z awseb-e-x-AWSEBLoa-abc123 xxx.xxx.xxx.226:45505 xxx.xxx.xxx.85:80 0.000037 0.172259 0.000072 200 200 334 93 "POST https://my.api.com:443/v1/data/add HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 2018-03-04T21:08:02.896099Z awseb-e-x-AWSEBLoa-abc123 xxx.xxx.xxx.226:55647 xxx.xxx.xxx.112:80 0.000041 0.166058 0.000064 200 200 334 93 "POST https://my.api.com:443/v1/data/add HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 2018-03-04T21:08:08.914958Z awseb-e-x-AWSEBLoa-abc123 xxx.xxx.xxx.226:10771 xxx.xxx.xxx.85:80 0.000046 0.173661 0.000091 200 200 341 93 "POST https://my.api.com:443/v1/data/add HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 ```

There is no correlated loss of healthiness reported in the logs on the actual ec2 instances.

It seems to go away for a couple of days after rebuilding the underlying ec2 instances.

Solution

The culprit was bad proxy configs in my .ebextensions directory, specifically too high of values for proxy_connect_timeout, proxy_send_timeout, and proxy_read_timeout:

files: "/etc/nginx/conf.d/proxy.conf" : mode: "000644" owner: root group: root content: | client_max_body_size 500m; proxy_buffers 8 16k; proxy_buffer_size 32k; proxy_connect_timeout 1800s; proxy_send_timeout 1800s; proxy_read_timeout 1800s; gzip on; gzip_vary on; gzip_proxied any; gzip_comp_level 4; gzip_static on; gzip_http_version 1.1; gzip_min_length 256; gzip_types text/plain text/css application/json application/x-javascript text/xml application/xml application/xml+rss text/javascript application/javascript application/vnd.ms-fontobject application/x-font-ttf font/opentype image/svg+xml image/x-icon;

I had originally set these in attempt to to allow large uploads, but this was erroneous. Removing the settings and redeploying worked like a charm.