Search code examples
amazon-web-servicesamazon-ec2amazon-ecs

AWS ECS long running jobs scale in


I have long running jobs running on containers in ECS. When autoscale triggers scale in, is there a way to tell ECS to not kill the task for X amount of time (or until an event is triggered), so the job can finish and only then terminate?

Let's say I have 10 containers on 10 instances, and 3 of them are running jobs right now, I would like ECS to not terminate those instances, and take into account for the scale in only the 7 remaining instances. Is such a thing supported in ECS?


Solution

  • There is no way to signal to ECS that this specific task out of the ten should be skipped when it comes to scaling in.

    But if your goal is to not interrupt a running task, and give the task a chance to finish then you can use basic Unix primitive concepts to accomplish this. When ECS scales down it tells the Docker daemon to stop your container, and the docker daemon sends a sigterm signal to the process running in your container.

    Every runtime language has a way to trap this signal and add custom handling. The default handling if you don't have custom handling is for your process to instantly stop, but if you trap the sigterm signal then you can instead finish your work and then exit.

    For example in node.js here is how you do it:

    process.on('SIGTERM', function () {
      server.close(function () {
        process.exit(0);
      });
    });
    

    This code keeps the process open until the server has closed all connections and only then will the process exit.

    Alternatively if your business process takes a really long time you may be better served by using something like Amazon Batch (basically a higher level service on top of Amazon ECS which is designed for running a pool of long running tasks on a cluster, in response to trigger events).