Search code examples
node.jsnginxterraformamazon-ecsterraform-provider-aws

Terraform created AWS ECS infra: health check keep failing


In short, I want to deploy my Nginx and Node.js docker image to AWS ECS. To build the infra, I'm using Terraform. However, the task running in the server keeps failling. Also I got 503 Service Temporarily Unavailable when accessing my domain bb-diner-api-https.shaungc.com.

(You can see my entire project repo here, but I'll embed links below and walk you through specific related files.)

After terraform apply it reports 15 resources created and I can see the service & task running in the ECS web portal. However, my task will always fail after awhile like below:

enter image description here

Because the health check always fails:

enter image description here

For nodejs, I have error code 137, which is caused by receiving shutdown signal. This means nodejs is not the cause - it's nginx failed too many health checks such that it terminates nodejs. For nginx, it shows no message at all after clicking in View logs in CloudWatch (I did setup awslogs in task definition).

enter image description here

My health check setting

Task definition container health check

Basically I prepared a route in nginx just for health check. In task definition > container_definition (json format), I have a health check on container nginx like this: "command": ["CMD-SHELL","curl -f http://localhost/health-check || exit 1"], and in my nginx.conf I have:

...
server {
  listen 80;
  ...

  location /health-check {
        # access_log off;
        return 200 "I'm healthy!" ; # refer to https://serverfault.com/questions/518220/nginx-solution-for-aws-amazon-elb-health-checks-return-200-without-if 
  }
}

So I really don't know why the task is failing health checks.

Target Group health check for load balancer

I also created an Application Load Balancer for me to link my domain name on Route 53 to it. I notice there's another place doing health check: target group and application load balancer. The check failed here too and my instance status is draining.

enter image description here enter image description here

Security Group

I think I opened all the possible ports.

enter image description here

So Why Health Check Fails & What Else is Missing?

There're a lot of articles pointing out the Nginx configuration, PORT or inbound limitation (security group/target group) on AWS can be the common causes, and I took a look at all of them. I let nginx listen to 80, set the container port as 80, allow a wide range of inbound ports in security group. What else can I be missing?


Solution

  • I kind of figured out by myself. While I never get the container level health check passed, I managed to fix the health check failure on application load balancer.

    Problem & Cause

    It turns out that it has something to do with the security group of the EC2 instance. I notice this when I was following a AWS troubleshooting page for health check failure, where they advise to ssh into the instance and try a curl -v ... on the instance directly. The curl failed, and I found that my EC2 instance security group was using the default sg. While the default security group (sg) allows all traffic, it limits its source to itself, i.e. default security group. This can be confusing, but I think it indicates that it's only allowing traffic from aws services which use default security group as well. Regardless, this blocks any traffic outside of aws service, so I can't access via my domain name, nor does the ALB health check agent.

    Solution

    My final solution is to have a dedicated security group for ALB, and then create a new security group for EC2 instances that allows only traffic from ALB's security group. Also note that since we already limit port to 80 & 443 in ALB's security group, and now EC2 instance sg is set behind ALB's sg (all internal traffic now), there's no need to limit port to 80 / 443 in EC2 instance sg. You can leave it as 0 to allow all port. If you limit to the wrong port, the health check will start failing. See the following from the AWS trouble shooting page:

    1. Confirm that the security group associated with your container instance allows all ingress traffic on the ephemeral port range (typically ports 32768-65535) from the security group associated with your load balancer

    Important: If you declare the host port in your task definition, the service will be exposed on the specified port rather than in the ephemeral port range. For this reason, be sure that your security group reflects the specified host port instead of the ephemeral port range.


    Other Concerns

    This really took me alot of effort and time to figure out. A small side note is that I still can't get the container level health check to work, which is defined in task definition of AWS ECS. I tried ssh into the container instance (EC2 instance), and it turns out localhost is apparently not working. Even the AWS trouble shooting page is using some ip address generated from docker inspect when testing the curl on EC2 instance directly. But then for the task definition container health check, if not checking on localhost, what should I check on? Should I run docker inspect in the health check command as well to obtain the ip address first? This problem remains unsolved, now I just give a exit 0 to bypass the health check. If anyone knows what is the correct way to configure this, feel free to share and I really want to know as well.