Search code examples
amazon-web-servicesdockernginxdocker-composeamazon-ecs

Nginx container on ECS not following rolling update


I'm so close to finding a nice setup with Docker Compose and ECS, but there is one small thing remaining.

The scenario goes like this:

  1. Update app (Django) source code and deploy to ECS using Docker Compose and Docker Context.
  2. ECS registers a new task for the app and starts it along with the old one.
  3. Problem: Nginx does health checks on the old container and when that is deregistered, nginx starts throwing 502 errors and restarts the task, leading to downtime and unavailability.
  4. Nginx starts up again and does health checks on the new container, app working again, but with undesired downtime as mentioned.

Is there some config I need to do here? Am I missing something?

docker-compose.yml for reference:

services:
  web:
    image: # Image from ECR, built from GH-action.
    command: gunicorn core.wsgi:application --bind 0.0.0.0:8000
    environment:
      # ...
    volumes:
      # ...
    deploy:
      replicas: 1

  nginx:
    image: # Image from ECR, kept static
    ports:
      - "80:80"
    volumes:
      # ...
    depends_on:
      - web
    deploy:
      replicas: 1

Solution

  • So it turns out this was quite the challenge. The bottom line is that an ECS task cannot be updated while it's running. So we need to restart the task or use the execute-command to manually update it.

    I tried the jwilder/nginx-proxy approach, but this was not possible with Fargate because of the way volume mounting works with that launch type.

    I ended up using the sidecar pattern for my nginx container, however, there is currently no solution available for sidecars with the Compose-CLI (see https://github.com/docker/compose-cli/issues/1566), so I had to use the x-aws-cloudformation overlay in a slightly messy way.

    So first we just remove the nginx service:

    version: "3.9"
    
    services:
      web:
        image: # django-app
        command: gunicorn core.wsgi:application --bind 0.0.0.0:8000
        environment:
          # ...
        volumes:
          # ...
        ports:
          - "80:80" # Move ports into this service so we get the ALB 
        deploy:
          replicas: 1
    

    Run convert command to get the generated CloudFormation template:

    docker compose covert > cfn.yml
    

    Then add the x-aws-cloudformation overlay:

    x-aws-cloudformation:
      Resources:
        WebTaskDefinition:
          Properties:
            ContainerDefinitions:
              # Generated container definitions, copy/paste from cfn.yml
              # Only change ContainerPort for web:
              - # ...Web_ResolvConf_InitContainer
              - # ...Web-container
                PortMappings:
                  - ContainerPort: 8000
    
              # The nginx sidecar:
              - DependsOn:
                  - Condition: SUCCESS
                    ContainerName: Web_ResolvConf_InitContainer
                  - Condition: START
                    ContainerName: web
                Essential: true
                Image: # nginx
                LogConfiguration:
                  # ...
                MountPoints:
                  # ...
                Name: nginx
                PortMappings:
                  - ContainerPort: 80
        
        # We also need to tell the load-balancer to reference the nginx container
        WebService:
          Properties:
            LoadBalancers:
              - ContainerName: nginx
                ContainerPort: 80
                TargetGroupArn:
                  Ref: WebTCP80TargetGroup
    

    Finally, we need to change the nginx config a bit

    # BEFORE
    upstream app_server {
        server web:8000 fail_timeout=0;
    }
    
    # AFTER
    upstream app_server {
        server 0.0.0.0:8000 fail_timeout=0;
    }
    

    Not pretty, but it works. Rolling updates will now have zero downtime as expected. Hopefully, this pattern will be improved with the evolution of the Compose-CLI!