amazon-web-services timeout amazon-iam amazon-ecs

ECS Execution Role causes log driver failure during container startup?

When using a custom IAM Role as an ECS Task Definition'scustom execution role, our resulting Service wil fail to startup on our ECS instance due to an inability to initialize the CloudWatch logging driver. Specifically, we see the following errors from ECS agent in CloudWatch:

2019-10-24T21:43:10Z [INFO] TaskHandler: Adding event: TaskChange: [arn:aws:ecs:us-west-1:REDACTED -> STOPPED, Known Sent: NONE, PullStartedAt: 2019-10-24 21:43:08.499577397 +0000 UTC m=+187.475751716, PullStoppedAt: 2019-10-24 21:43:09.69279918 +0000 UTC m=+188.668973506, ExecutionStoppedAt: 2019-10-24 21:43:10.153954812 +0000 UTC m=+189.130129126, arn:aws:ecs:us-west-1:REDACTED wordpress -> STOPPED, Reason CannotStartContainerError: Error response from daemon: failed to initialize logging driver: CredentialsEndpointError: failed to load credentials

caused by: Get http://169.254.170.2/v2/credentials/REDACTED: dial tcp 169.254.170.2:80: connect: connection refused, Known Sent: NONE] sent: false

This "connection refused error" used to be a timeout error, but I attempted to debug this issue after reading similar problems by adding iptables entries from https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-install.html even though this is a Amazon ECS provisioned CoreOS EC2 instance (not a custom one).

Essentially that link and other issues similar to mine recommended the following, which change the error to a timeout error at least:

ubuntu:~$ sudo iptables -t nat -A PREROUTING -p tcp -d 169.254.170.2 --dport 80 -j DNAT --to-destination 127.0.0.1:51679
ubuntu:~$ sudo iptables -t nat -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679

Note that this container definition runs and works completely fine under normal conditions when we don't use a custom IAM execution role in the container definition; but since I am attempting to add an AWS SecretsManager secret in the Task Definition; this requires us to define a custom role that has access to the secret.

EDIT: Here is both the role policy JSON and the cloud-config.yml for the ECS instance:

JSON Policy Role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
        "elasticloadbalancing:DeregisterTargets",
        "elasticloadbalancing:Describe*",
        "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
        "elasticloadbalancing:RegisterTargets"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    },
    {
        "Effect": "Allow",
        "Action": [
        "ssm:GetParameters",
        "secretsmanager:GetSecretValue",
        "kms:Decrypt"
        ],
        "Resource": [
            "${var.aws_mysql_secret_arn}"
        ]
    }
  ]
}

cloud-config.yml

coreos:
  units:
   - name: update-engine.service
     command: stop
   - name: amazon-ecs-agent.service
     command: start
     runtime: true
     content: |
       [Unit]
       Description=AWS ECS Agent
       Documentation=https://docs.aws.amazon.com/AmazonECS/latest/developerguide/
       Requires=docker.socket
       After=docker.socket

       [Service]
       Environment=ECS_CLUSTER=${ecs_cluster_name}
       Environment=ECS_LOGLEVEL=${ecs_log_level}
       Environment=ECS_VERSION=${ecs_agent_version}
       Restart=on-failure
       RestartSec=30
       RestartPreventExitStatus=5
       SyslogIdentifier=ecs-agent
       ExecStartPre=-/bin/mkdir -p /var/log/ecs /var/ecs-data /etc/ecs
       ExecStartPre=-/usr/bin/docker kill ecs-agent
       ExecStartPre=-/usr/bin/docker rm ecs-agent
       ExecStartPre=iptables -t nat -A PREROUTING -p tcp -d 169.254.170.2 --dport 80 -j DNAT --to-destination 127.0.0.1:51679
       ExecStartPre=iptables -t nat -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
       ExecStartPre=/usr/bin/docker pull amazon/amazon-ecs-agent:$${ECS_VERSION}
       ExecStart=/usr/bin/docker run --name ecs-agent \
                                     --volume=/var/run/docker.sock:/var/run/docker.sock \
                                     --volume=/var/log/ecs:/log \
                                     --volume=/var/ecs-data:/data \
                                     --volume=/sys/fs/cgroup:/sys/fs/cgroup:ro \
                                     --volume=/run/docker/execdriver/native:/var/lib/docker/execdriver/native:ro \
                                     --publish=127.0.0.1:51678:51678 \
                                     --env=ECS_LOGFILE=/log/ecs-agent.log \
                                     --env=ECS_LOGLEVEL=$${ECS_LOGLEVEL} \
                                     --env=ECS_DATADIR=/data \
                                     --env=ECS_CLUSTER=$${ECS_CLUSTER} \
                                     --env=ECS_AVAILABLE_LOGGING_DRIVERS='["awslogs"]' \
                                     --env=ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE=true \
                                     --log-driver=awslogs \
                                     --log-opt awslogs-region=${aws_region} \
                                     --log-opt awslogs-group=${ecs_log_group_name} \
                                     amazon/amazon-ecs-agent:$${ECS_VERSION}

Solution

The solution for our case was to switch the host of the ECS agent's network mode to "host" mode, rather than bridged mode (--net=host). This was because ECS agent no longer supports bridged mode. In addition to this, we added the IP tables rules and localnet.conf to ensure routing was setup correctly.

Here is the resulting template that wound up working for us:

#cloud-config
coreos:
  units:
   - name: iptables-restore.service
     command: start
     runtime: true
   - name: systemd-sysctl.service
     command: start
     runtime: true
   - name: update-engine.service
     command: stop
   - name: amazon-ecs-agent.service
     command: start
     runtime: true
     content: |
       [Unit]
       Description=AWS ECS Agent
       Documentation=https://docs.aws.amazon.com/AmazonECS/latest/developerguide/
       Requires=docker.socket
       After=docker.socket

       [Service]
       Environment=ECS_CLUSTER=${ecs_cluster_name}
       Environment=ECS_LOGLEVEL=${ecs_log_level}
       Environment=ECS_VERSION=latest
       Restart=on-failure
       RestartSec=30
       RestartPreventExitStatus=5
       SyslogIdentifier=ecs-agent
       ExecStartPre=-/bin/mkdir -p /var/log/ecs /var/ecs-data /etc/ecs
       ExecStartPre=-/usr/bin/touch /etc/ecs/ecs.config
       ExecStartPre=-/usr/bin/docker kill ecs-agent
       ExecStartPre=-/usr/bin/docker rm ecs-agent
       ExecStartPre=/usr/bin/docker pull amazon/amazon-ecs-agent:${ECS_VERSION}
       ExecStart=/usr/bin/docker run --name ecs-agent \
                                     --env-file=/etc/ecs/ecs.config \
                                     --volume=/var/run/docker.sock:/var/run/docker.sock \
                                     --volume=/var/log/ecs:/log \
                                     --volume=/var/ecs-data:/data \
                                     --volume=/sys/fs/cgroup:/sys/fs/cgroup:ro \
                                     --volume=/run/docker/execdriver/native:/var/lib/docker/execdriver/native:ro \
                                     --net=host \
                                     --env=ECS_ENABLE_TASK_IAM_ROLE=true \
                                     --env=ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true \
                                     --env=ECS_LOGFILE=/log/ecs-agent.log \
                                     --env=ECS_LOGLEVEL=${ECS_LOGLEVEL} \
                                     --env=ECS_DATADIR=/data \
                                     --env=ECS_CLUSTER=${ECS_CLUSTER} \
                                     --env=ECS_AVAILABLE_LOGGING_DRIVERS='["awslogs","json-file"]' \
                                     --env=ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE=true \
                                     --log-driver=awslogs \
                                     --log-opt awslogs-region=${aws_region} \
                                     --log-opt awslogs-group=${ecs_log_group_name} \
                                     amazon/amazon-ecs-agent:${ECS_VERSION}
write_files:
  - path: /var/lib/iptables/rules-save
    permissions: 0644
    owner: 'root:root'
    content: |
      *nat
      -A PREROUTING -d 169.254.170.2/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 127.0.0.1:51679
      -A OUTPUT -d 169.254.170.2/32 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
      COMMIT
  - path: /etc/sysctl.d/localnet.conf
    permissions: 0644
    owner: 'root:root'
    content: |
      net.ipv4.conf.all.route_localnet=1