Search code examples
amazon-ecsaws-fargatehealth-check

Pattern for replacing running tasks


We are trying to minimise downtime on a service in ECS. The ECS Service has a task with a simple container, that has a container health check defined (no ALB in this case). Sometimes the service receives updated config via an internal mechanism, and at this point it needs to be replaced with a new task that will read the new config. We implemented this by having the existing task starting to fail the health check when it notices new config, assuming that ECS would detect this and replace the task.

The replacement happens, however the failing task is shutdown before the replacement is healthy.

Example events order:

  • 12:50:31 - Service gets new config alert, starts responding with 503 on health check
  • 12:50:59 - ECS detects failing health check, and sends SIG KILL to service
  • 12:51:10 - ECS starts a new task to replace it
  • 12:51:37 - New task running.

With this pattern, we have a period of downtime etween 12:50:59 and 12:51:37. Is there a way to tell ECS to start the new task and wait for it to be healthy before it terminates the unhealthy one?

Desired events order:

  • Service gets new config alert, starts responding with 503 on health check
  • ECS starts a new task to replace it
  • New task healthy
  • ECS kills old task

Service set with Min healthy percent = 100%, Max healthy percent 200%

Container health check:

 healthCheck : {
  startPeriod : 20
  retries : 2,
  command : [
    "CMD-SHELL",
    "curl -f http://localhost:8080/sys/health-check || exit 1"
  ],
  timeout : 5,
  interval : 10
}

Solution

  • It's shutting down the old task earlier than you want, because your task is failing health checks, so as far as ECS is concerned the task can not handle any more requests and is being removed immediately. Tricking ECS to redeploy by failing the health check isn't really the cleanest way to trigger a new deployment.

    The ideal way to trigger a new deployment is to run the CLI command aws ecs update-service --force-new-deployment (or the equivalent API call in one of the AWS SDKs). This tells ECS that you want to replace the existing tasks with new tasks, using the configured min-healthy and max-healthy deployment settings on the ECS service.

    I would suggest changing your internal mechanism of updating the config to trigger that command after the config is changed. Alternatively you might think about having a Lambda function that is triggered by the config change, which then calls the ECS API to trigger a new service deployment.