amazon-web-services amazon-ecs aws-fargate

Can't connect to fargate task which Executes command even though all permissions are set

I'm having trouble connecting to a fargate container with the ECS Execute command and it gives out the following error:

An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation: The execute command failed due to an internal error. Try again later.

I've made sure I have the right permissions and setup by using ecs-checker and I'm connecting to it using the following command:

aws ecs execute-command --cluster {cluster-name} --task {task_id} --container {container name} --interactive --command "/bin/bash".

I've noticed that this can usually happen when you don't have the necessary permissions but as I've pointed out above I've already checked with the ecs-checker.sh and here is the output from it:

-------------------------------------------------------------
Prerequisites for the AWS CLI to use ECS Exec
-------------------------------------------------------------
  AWS CLI Version        | OK (aws-cli/2.13.4 Python/3.11.4 Darwin/22.4.0 source/arm64 prompt/off)
  Session Manager Plugin | OK (1.2.463.0)

-------------------------------------------------------------
Checks on ECS task and other resources
-------------------------------------------------------------
Region : eu-west-2
Cluster: cluster
Task   : 47e51750712a4e1c832dd996c878f38a
-------------------------------------------------------------
  Cluster Configuration  | Audit Logging Not Configured
  Can I ExecuteCommand?  | arn:aws:iam::290319421751:role/aws-reserved/sso.amazonaws.com/eu-west-2/AWSReservedSSO_PowerUserAccess_01a9cfdb5ba4af7f
     ecs:ExecuteCommand: allowed
     ssm:StartSession denied?: allowed
  Task Status            | RUNNING
  Launch Type            | Fargate
  Platform Version       | 1.4.0
  Exec Enabled for Task  | OK
  Container-Level Checks |
    ----------
      Managed Agent Status
    ----------
         1. RUNNING for "WebApp"
    ----------
      Init Process Enabled (WebAppTaskDefinition:49)
    ----------
         1. Enabled - "WebApp"
    ----------
      Read-Only Root Filesystem (WebAppTaskDefinition:49)
    ----------
         1. Disabled - "WebApp"
  Task Role Permissions  | arn:aws:iam::290319421751:role/task-role
     ssmmessages:CreateControlChannel: allowed
     ssmmessages:CreateDataChannel: allowed
     ssmmessages:OpenControlChannel: allowed
     ssmmessages:OpenDataChannel: allowed
  VPC Endpoints          |
    Found existing endpoints for vpc-11122233444:
      - com.amazonaws.eu-west-2.monitoring
      - com.amazonaws.eu-west-2.ssmmessages
  Environment Variables  | (WebAppTaskDefinition:49)
       1. container "WebApp"
       - AWS_ACCESS_KEY: not defined
       - AWS_ACCESS_KEY_ID: not defined
       - AWS_SECRET_ACCESS_KEY: not defined

What is weird about this situation is that there are 4 environments that the service is deployed to and it works on all of them except on one of them. And they are all the same resources deployed since the clusters are created through a cloudformation template. The image deployed is also the same in all 4 environments.

Any ideas on what could cause this?

Solution

It seems there was a VPC access point setup for SSM in that environment which was not needed in our case since the tasks already had pubilc network access.

Weirdly enough, when we removed the VPC endpoint the problem went away. It might have been not set up correctly with the VPC endpoint security groups and so If you have a situation similar to this one I encourage you to check If you have misconfigured VPC endpoints for SSM and remove or fix them depending on your use case.