Search code examples
amazon-web-servicesautoscalingamazon-ecs

ECS ASG scaling down policy recommendations


We're using AWS ECS to handle our services as a fleet of containers. The EC2 instances, on which the Docker/ECS agents run, are in an auto-scaling group, whose number of instances is based on custom metrics to ensure we always globally have enough available memory to start a few containers at once, but not too much to limit the costs.

There are no issues when scaling up, but scaling down according to the available memory means that a server with running containers can get removed (and the containers arbitrarily killed). It wasn't an issue until recently, because each critical service was running at least two tasks, so one task could be closed and could restart somewhere else without any service interruption.

But we now have services (Jenkins + remote slaves) that would better not get interrupted (or it may cut the slave -> master connection and make a job build crash).

I have a few ideas to try to handle that, but I'm wondering if there are recommandations, AWS options or a clever way to proceed to allow an ECS cluster to scale down while avoiding casualties...


Solution

  • Ok, I added control over the autoscaling operations by playing with the termination protection.

    Each ECS instance regularly runs a script that returns metrics. In this script, I added the following part:

    if [ $noContainersRunning == 1 ] && [ $asgProtection == true ]; then
        aws autoscaling set-instance-protection --region $region --instance-ids $instanceId --auto-scaling-group-name $asgId --no-protected-from-scale-in
        echo "Disabling scale-in protection"
    elif  [ $noContainersRunning == 0 ] && [ $asgProtection == false ]; then
        aws autoscaling set-instance-protection --region $region --instance-ids $instanceId --auto-scaling-group-name $asgId --protected-from-scale-in
        echo "Enabling scale-in protection"
    fi
    

    (with noContainerRunning being a variable that contains 1 if no tasks are running on this instance, and asgProtection the current state of the termination protection on this instance)

    Consequently, the auto-scaling group won't be able to remove an instance that contains a running container. If all instances run at least one container, the desired count will go down, but the auto-scaling will return cancelled: could not scale until an instance has 0 containers running again.

    It's working quite well!

    I also recreated the services with the bin-pack task placement template, to ensure the containers don't spread on all the instances of the cluster.