When using a custom IAM Role as an ECS Task Definition'scustom execution role, our resulting Service wil fail to startup on our ECS instance due to an inability to initialize the CloudWatch logging driver. Specifically, we see the following errors from ECS agent in CloudWatch:
2019-10-24T21:43:10Z [INFO] TaskHandler: Adding event: TaskChange: [arn:aws:ecs:us-west-1:REDACTED -> STOPPED, Known Sent: NONE, PullStartedAt: 2019-10-24 21:43:08.499577397 +0000 UTC m=+187.475751716, PullStoppedAt: 2019-10-24 21:43:09.69279918 +0000 UTC m=+188.668973506, ExecutionStoppedAt: 2019-10-24 21:43:10.153954812 +0000 UTC m=+189.130129126, arn:aws:ecs:us-west-1:REDACTED wordpress -> STOPPED, Reason CannotStartContainerError: Error response from daemon: failed to initialize logging driver: CredentialsEndpointError: failed to load credentials
caused by: Get http://169.254.170.2/v2/credentials/REDACTED: dial tcp 169.254.170.2:80: connect: connection refused, Known Sent: NONE] sent: false
This "connection refused error" used to be a timeout error, but I attempted to debug this issue after reading similar problems by adding iptables entries from https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-install.html even though this is a Amazon ECS provisioned CoreOS EC2 instance (not a custom one).
Essentially that link and other issues similar to mine recommended the following, which change the error to a timeout error at least:
ubuntu:~$ sudo iptables -t nat -A PREROUTING -p tcp -d 169.254.170.2 --dport 80 -j DNAT --to-destination 127.0.0.1:51679
ubuntu:~$ sudo iptables -t nat -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
Note that this container definition runs and works completely fine under normal conditions when we don't use a custom IAM execution role in the container definition; but since I am attempting to add an AWS SecretsManager secret in the Task Definition; this requires us to define a custom role that has access to the secret.
EDIT: Here is both the role policy JSON and the cloud-config.yml for the ECS instance:
JSON Policy Role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:Describe*",
"elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
"elasticloadbalancing:DeregisterTargets",
"elasticloadbalancing:Describe*",
"elasticloadbalancing:RegisterInstancesWithLoadBalancer",
"elasticloadbalancing:RegisterTargets"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ssm:GetParameters",
"secretsmanager:GetSecretValue",
"kms:Decrypt"
],
"Resource": [
"${var.aws_mysql_secret_arn}"
]
}
]
}
cloud-config.yml
coreos:
units:
- name: update-engine.service
command: stop
- name: amazon-ecs-agent.service
command: start
runtime: true
content: |
[Unit]
Description=AWS ECS Agent
Documentation=https://docs.aws.amazon.com/AmazonECS/latest/developerguide/
Requires=docker.socket
After=docker.socket
[Service]
Environment=ECS_CLUSTER=${ecs_cluster_name}
Environment=ECS_LOGLEVEL=${ecs_log_level}
Environment=ECS_VERSION=${ecs_agent_version}
Restart=on-failure
RestartSec=30
RestartPreventExitStatus=5
SyslogIdentifier=ecs-agent
ExecStartPre=-/bin/mkdir -p /var/log/ecs /var/ecs-data /etc/ecs
ExecStartPre=-/usr/bin/docker kill ecs-agent
ExecStartPre=-/usr/bin/docker rm ecs-agent
ExecStartPre=iptables -t nat -A PREROUTING -p tcp -d 169.254.170.2 --dport 80 -j DNAT --to-destination 127.0.0.1:51679
ExecStartPre=iptables -t nat -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
ExecStartPre=/usr/bin/docker pull amazon/amazon-ecs-agent:$${ECS_VERSION}
ExecStart=/usr/bin/docker run --name ecs-agent \
--volume=/var/run/docker.sock:/var/run/docker.sock \
--volume=/var/log/ecs:/log \
--volume=/var/ecs-data:/data \
--volume=/sys/fs/cgroup:/sys/fs/cgroup:ro \
--volume=/run/docker/execdriver/native:/var/lib/docker/execdriver/native:ro \
--publish=127.0.0.1:51678:51678 \
--env=ECS_LOGFILE=/log/ecs-agent.log \
--env=ECS_LOGLEVEL=$${ECS_LOGLEVEL} \
--env=ECS_DATADIR=/data \
--env=ECS_CLUSTER=$${ECS_CLUSTER} \
--env=ECS_AVAILABLE_LOGGING_DRIVERS='["awslogs"]' \
--env=ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE=true \
--log-driver=awslogs \
--log-opt awslogs-region=${aws_region} \
--log-opt awslogs-group=${ecs_log_group_name} \
amazon/amazon-ecs-agent:$${ECS_VERSION}
The solution for our case was to switch the host of the ECS agent's network mode to "host" mode, rather than bridged mode (--net=host). This was because ECS agent no longer supports bridged mode. In addition to this, we added the IP tables rules and localnet.conf to ensure routing was setup correctly.
Here is the resulting template that wound up working for us:
#cloud-config
coreos:
units:
- name: iptables-restore.service
command: start
runtime: true
- name: systemd-sysctl.service
command: start
runtime: true
- name: update-engine.service
command: stop
- name: amazon-ecs-agent.service
command: start
runtime: true
content: |
[Unit]
Description=AWS ECS Agent
Documentation=https://docs.aws.amazon.com/AmazonECS/latest/developerguide/
Requires=docker.socket
After=docker.socket
[Service]
Environment=ECS_CLUSTER=${ecs_cluster_name}
Environment=ECS_LOGLEVEL=${ecs_log_level}
Environment=ECS_VERSION=latest
Restart=on-failure
RestartSec=30
RestartPreventExitStatus=5
SyslogIdentifier=ecs-agent
ExecStartPre=-/bin/mkdir -p /var/log/ecs /var/ecs-data /etc/ecs
ExecStartPre=-/usr/bin/touch /etc/ecs/ecs.config
ExecStartPre=-/usr/bin/docker kill ecs-agent
ExecStartPre=-/usr/bin/docker rm ecs-agent
ExecStartPre=/usr/bin/docker pull amazon/amazon-ecs-agent:${ECS_VERSION}
ExecStart=/usr/bin/docker run --name ecs-agent \
--env-file=/etc/ecs/ecs.config \
--volume=/var/run/docker.sock:/var/run/docker.sock \
--volume=/var/log/ecs:/log \
--volume=/var/ecs-data:/data \
--volume=/sys/fs/cgroup:/sys/fs/cgroup:ro \
--volume=/run/docker/execdriver/native:/var/lib/docker/execdriver/native:ro \
--net=host \
--env=ECS_ENABLE_TASK_IAM_ROLE=true \
--env=ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true \
--env=ECS_LOGFILE=/log/ecs-agent.log \
--env=ECS_LOGLEVEL=${ECS_LOGLEVEL} \
--env=ECS_DATADIR=/data \
--env=ECS_CLUSTER=${ECS_CLUSTER} \
--env=ECS_AVAILABLE_LOGGING_DRIVERS='["awslogs","json-file"]' \
--env=ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE=true \
--log-driver=awslogs \
--log-opt awslogs-region=${aws_region} \
--log-opt awslogs-group=${ecs_log_group_name} \
amazon/amazon-ecs-agent:${ECS_VERSION}
write_files:
- path: /var/lib/iptables/rules-save
permissions: 0644
owner: 'root:root'
content: |
*nat
-A PREROUTING -d 169.254.170.2/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 127.0.0.1:51679
-A OUTPUT -d 169.254.170.2/32 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
COMMIT
- path: /etc/sysctl.d/localnet.conf
permissions: 0644
owner: 'root:root'
content: |
net.ipv4.conf.all.route_localnet=1