amazon-web-services docker containers cloud amazon-ecs

Elastic Container Service with selenium stand alone image stuck in pending before failing and rolling back

To start of, I am very new to AWS. I am trying to get a stand alone selenium node to run as a service on ECS using fargate. I managed to get 1 task up and running but only using the httpd image from the public gallery for ECR. I have tried the below :

First I thought that maybe my networking was wrong when I created my own vpc, so I created a service using the public ECS repo and used httpd and confirmed that it could not pull the image from my vpc
Did the above but this time used the default vpc and it worked perfectly. So I tried using the docker registry to pull the selenium image using the default vpc and I still got failures
Tried looking at the logs but I keep getting the error message 'no log group ecs exists'. This made me think that the image pulling was the problem
I then created my own public repo on ECR and I am still unable to get it to the 'running status'

Any help or pointing me in the right direction will help. I am 100% sure its not a networking issue now as I used the same vpc and security group to get httpd running

logs from cloud formation :

Task Definition (removed sensitive information):

{
    "taskDefinitionArn": "",
    "containerDefinitions": [
        {
            "name": "selenium-node",
            "image": "",
            "cpu": 0,
            "portMappings": [
                {
                    "name": "selenium-node-4444-tcp",
                    "containerPort": 4444,
                    "hostPort": 4444,
                    "protocol": "tcp",
                    "appProtocol": "http"
                },
                {
                    "name": "selenium-node-7900-tcp",
                    "containerPort": 7900,
                    "hostPort": 7900,
                    "protocol": "tcp",
                    "appProtocol": "http"
                }
            ],
            "essential": true,
            "environment": [
                {
                    "name": "shm-size",
                    "value": "2g"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-create-group": "true",
                    "awslogs-group": "/ecs/",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ecs"
                },
                "secretOptions": []
            }
        }
    ],
    "family": "selenium-node-task",
    "executionRoleArn": "",
    "networkMode": "awsvpc",
    "revision": 5,
    "volumes": [],
    "status": "ACTIVE",
    "requiresAttributes": [
        {
            "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
        },
        {
            "name": "ecs.capability.execution-role-awslogs"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        },
        {
            "name": "ecs.capability.task-eni"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.29"
        }
    ],
    "placementConstraints": [],
    "compatibilities": [
        "EC2",
        "FARGATE"
    ],
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "cpu": "1024",
    "memory": "2048",
    "runtimePlatform": {
        "cpuArchitecture": "X86_64",
        "operatingSystemFamily": "LINUX"
    },
    "registeredAt": "2023-08-09T21:51:06.207Z",
    "registeredBy": "",
    "tags": []
}

I tried using a different vpc, different repo, internet gateway and nat gateway on custom vpc, different image (works). Just selenium is proving the issue but works fine locally on docker

Solution

Finally figured it out. If anyone else comes across this hair pulling issue then remember not to use the default policy of AmazonECSTaskExecutionRolePolicy. This policy does not allow the logs:CreateLogGroup action and if you specify that the log group must be created in your logConfiguration section of your task definition, it will fail without leaving any clue as to why.

Default Policy

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "*" } ] }

Custom Policy that works:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "logs:CreateLogStream", "logs:PutLogEvents", "logs:CreateLogGroup" ], "Resource": "*" } ] }

And then finally attach your new custom policy to the executionRole that you were using to launch your tasks.