I deploy my application on AWS ECS using EC2 instances. It's being managed through Terraform and I have not faced such issue before.
Recently I made some changes on my backend which introduced a bug which could cause a timeout. Because of that, my backend crushed.
The event messages shown on the deployment tab of the service are as following:
service backend instance i-000a000b0b0000c port 8000 is unhealthy in target-group backend-prod-backend L due to (reason Request timed out)
service backend has stopped 1 running tasks: task a000000f000f00000000ffef00a0f0af.
service backend deregistered 1 targets in target-group backend-prod-backend
(service backend, taskSet ecs-svc/0000000000000000000) has begun draining connections on 1 tasks.
service backend deregistered 1 targets in target-group backend-prod-backend
service backend has started 1 tasks: task a111111f111f111231241ffef00a0f0af.
As it can be seen, the task was stopped and a new task was started. However, the new task was stuck with a Provisioning
status, making it pending.
I tried restarting the EC2 instances. I even deleted the whole ECS cluster and all the instances and rerun Terraform so it could redo everything, but it still stays on the same status.
I have the same version on another environment with the same configuration, but it was able to restart.
I know this might too specific, so I'm not hoping for an answer that solves the issue, but mostly for suggestion on how to deal with such case. How can I debug it? Is there a better way I can make it restart?
Answering my question, as I was able to resolve this. The Container Instance Agent Version had automatically upgraded, requiring more resources (Memory/CPU) to start. I had to release some of the resources allocated to the ECS Task in order to allow the agent start the task.