Search code examples
dockerjenkinsamazon-ecsjenkins-agent

Jenkins slave running in ECS cluster can not start container


I'm using Jenkins slave in AWS ECS cluster, I config like this web: Jenkins in ECS.

Normally it works well, but sometimes in rush hour, the slave container starts very slow, more than 40mins, or even can not start container.

I have to terminated the ECS instance, then launch a new one. When the container cannot start I saw a logs in ecs-agent:

STOPPED, Reason CannotCreateContainerError: API error (500): devmapper: Thin Pool has 788 free data blocks which is less than minimum required 4454 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior

Here is my docker info, please advise me how to fix this issue.

[root@ip-10-124-2-159 ec2-user]# docker info
Containers: 10
 Running: 1
 Paused: 0
 Stopped: 9
Images: 2
Server Version: 1.12.6
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: ext4
 Data file:
 Metadata file:
 Data Space Used: 8.646 GB
 Data Space Total: 23.35 GB
 Data Space Available: 14.71 GB
 Metadata Space Used: 2.351 MB
 Metadata Space Total: 25.17 MB
 Metadata Space Available: 22.81 MB
 Thin Pool Minimum Free Space: 2.335 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.39-34.54.amzn1.x86_64
Operating System: Amazon Linux AMI 2016.09
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.8 GiB
Name: ip-10-124-2-159
ID: 6HVT:TWH3:YP6T:GMZO:23TM:EUAA:F7XJ:ISII:QDE7:V2SN:XKFI:XPGZ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

And I don't know why only 4 tasks can be run at the same time, even resource of ECS instance still available, how can I increase it


Solution

  • Your problem is a very common one when you start and stop containers very often, and the post you just mentioned is all about that! They specifically say that:

    "The Amazon EC2 Container Service Plugin can launch containers on your ECS cluster that automatically register themselves as Jenkins slaves, execute the appropriate Jenkins job on the container, and then automatically remove the container/build slave afterwards"

    The problem with this is that, if the stopped containers are not cleaned up, you eventually run out of memory, as you have experienced. You can check this yourself if you ssh into the instance and run the following command:

    docker ps -a
    

    If you run this command when Jenkins is getting in trouble, you should see an almost endless list of stopped containers. You can delete them all by running the following command:

    docker rm -f $(docker ps -a -f status-exited)
    

    However, doing this manually every so often is really not very convenient, so what you really want to do is to include the following script in the userData parameter of you ECS instance configuration when you launch it:

    ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=1m >> /etc/ecs/ecs.config
    ECS_CLUSTER=<NAME_OF_CLUSTER> >> /etc/ecs/ecs.config
    ECS_DISABLE_IMAGE_CLEANUP=false >> /etc/ecs/ecs.config
    ECS_IMAGE_CLEANUP_INTERVAL=10m >> /etc/ecs/ecs.config
    ECS_IMAGE_MINIMUM_CLEANUP_AGE=30m >> /etc/ecs/ecs.config
    

    This will instruct the ECS agent to enable a cleanup daemon that checks every 10 minutes (that is the lowest interval you can set) for images to delete, deletes containers 1 minute after the task has stopped, and deletes images which are 30 minutes old and no longer referenced by an active Task Definition. You can learn more about these variables here.

    In my experience, this configuration might not be enough if you start and stop containers very fast, so you may want to attach a decent volume to your instance in order to make sure you have enough space to carry on while the daemon cleans up the stopped containers.