Search code examples
amazon-web-servicesdockeramazon-ecs

Docker daemon crashes on EC2 instance


So, I'm updating a cloudformation template so that it uses the Amazon Linux 2 AMI optimized for ECS [amzn2-ami-ecs-hvm-2.0.20190709-x86_64-ebs (ami-0fac5486e4cff37f4)]. Previously I was using ami-00129b193dc81bc31 which is an Amazon Linux 1 container.

I initially found that simply changing out the AMI meant the EC2 instances no longer joined my ECS cluster. After a lot of digging (ensuring permissions, subnet, VPC, IAM were all right) I found that the docker daemon was crashing. The reason was traceable to what I was doing with the user data. I'll paste it below:

        "UserData": { "Fn::Base64" : { "Fn::Join" : ["", [
          "Content-Type: multipart/mixed; boundary=\"==BOUNDARY==\"\n",
          "MIME-Version: 1.0\n",
          "--==BOUNDARY==\n",
          "Content-Type: text/cloud-boothook; charset=\"us-ascii\"\n",
          "#!/bin/bash\n",
          "# Set Docker daemon options\n",
          "cloud-init-per once docker_debug echo 'OPTIONS=\"${OPTIONS} --storage-opt dm.basesize=10G\"' >> /etc/sysconfig/docker\n",
          "--==BOUNDARY==\n",
          "Content-Type: text/x-shellscript; charset=\"us-ascii\"\n",
          "#!/bin/bash -xe\n",
          "echo \"ECS_CLUSTER=",
          { "Ref" : "NovaProductionEcsCluster"},
          "\" >> /etc/ecs/ecs.config\n",
          "sudo mkdir /efs\n",
          "sudo mkdir /efs/nova_files\n",
          "sudo useradd -u 33 www-data\n",
          "sudo chown -R www-data /efs/nova_files/\n",
          "printf \"\nfunction novarun {\n docker exec -ti \\\"\\$(docker ps -qf name=ApacheTask)\\\" \\\"\\$([ \\$# -ne 1 ] && echo \\\"bash\\\" || echo \\\"\\$1\\\")\\\" \n}\n\" >> /home/ec2-user/.bashrc\n",
          "--==BOUNDARY==--"
        ]]}}
      }

Basically, this combines this (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html) with this (https://aws.amazon.com/premiumsupport/knowledge-center/increase-default-ecs-docker-limit/) to increase the disk space available to the docker container. This works perfectly with the old AMI, but the new AMI gives me Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? when I run docker ps -a.

If I run sudo journalctl -u docker to get the error Error starting daemon: error initializing graphdriver: overlay2: unknown option dm.basesize.

Can anyone tell me what's going wrong with this approach? Is there a way to fix this flag with ECS, or a different way to enlarge the amount of disk space available to the docker image?

Thanks


Solution

  • You can just remove that Docker configuration line altogether; your (newer) Docker daemon can use all available space in whatever partition is mounted on /var/lib/docker.

    There are several different systems (storage drivers) that Docker can use to store image and container data. Initially Docker used devicemapper – the "dm" in your option – which didn't require special Linux kernel support, but did have a fixed-size storage allocation for all Docker content. (Devicemapper was also slow and a little buggy; the better ways to use it involved giving it a dedicated disk partition, not just a file.) Most newer Docker installations use overlay2, which does require special kernel support, but now that's also fairly mainstream, and overlay2 avoids most of the problems of devicemapper.

    In short:

    1. Your newer AMI has a newer Docker and a newer kernel.
    2. Your newer Docker uses a different storage backend (overlay2).
    3. The dm.* options go with a different storage backend, hence your error message.
    4. You don't actually need special options any more, because one of the ways overlay2 is better than devicemapper is that it can natively use all of the host system disk without configuration.