Search code examples
amazon-ec2amazon-ecs

What to do when ECS-agent is disconnected?


I have an issue that from time to time one of the EC2 instances within my cluster have its ECS-agent disconnected. This silently removes the EC2 instance from the cluster (i.e. not eligible to run any services anymore) and silently drains my cluster from serving servers. I have my cluster backed with an autoscaling group, spawning servers to keep up the healthy amount. But the ECS-agent'disconnected servers are not marked as unhealthy, so the AS-group thinks everything is alright.

I have the feeling there must be something (easy) to mitigate this, or I'm having a big issue with choosing ECS and using it in production.


Solution

  • We had this issue for a long time. With each new AWS ECS-optimized AMI it got better, but as of 3 months ago it still happened from time to time. As mcheshier mentioned make sure to always use the latest AMI or at least the latest aws ecs agent

    The only way we were able to resolve it was through:

    1. Timed autoscale rotations
      1. We would try to prevent it by scaling up and down at random times
    2. Good cloudwatch alerts
      1. We happened to have our application set up as a bunch of microservices that were all queue (SQS) based. We could scale up and down based on queues. We had decent monitoring set up that let us approximate rates of queues across number of ECS containers. When we detected that the rate was off we would rotate that whole ECS instance. Ie. Say our cluster deployed 4 running containers of worker-1. We approximate that each worker does 1000 messages per 5 minutes. If our queue rate was 3000 per 5 minutes and we had 4 workers, then 1 was not working as expected. We had some scripts set up in lambda to find the faulty one and terminate the entire instance that ran that container.

    I hope this helps, I realize it's specific to our in-house application, but the advice I can give you and anyone else is to take the initiative and put as many metrics out there as you can. This will let you do some neat analytics and look for kinks in the system, this being one of them.