Search code examples
dockerdocker-builddocker-buildkit

Dockerfile: why ADD and RUN curl intermittently result in different image sizes?


I've been recently refactoring a Dockerfile and decided to try ADD over RUN curl to make the file cleaner. To my surprise, this resulted in quite a size difference:

$ docker images | grep test
test    curl    3aa809928665   7 minutes ago    746MB
test    add     da152355bb4d   3 minutes ago    941MB

Even more surprisingly, I tried a few Dockerfiles that do nothing except ADDing or curling things, and their sizes are identical. I also tried with and without buildkit, the result is the same (although without buildkit images are slightly smaller).

Here's the actual Dockerfile

FROM ubuntu:22.04
 
ENV AWSCLI_VERSION "2.7.31"
ENV HELM_VERSION "3.9.4"
ENV OC_VERSION "4.11.5"
ENV VAULT_VERSION "1.11.3"
ENV YQ_VERSION "4.27.5"
ENV YQ_BINARY "yq_linux_amd64"
 
ENV DEBIAN_FRONTEND "noninteractive"
 
ADD "https://awscli.amazonaws.com/awscli-exe-linux-x86_64-${AWSCLI_VERSION}.zip" /extras/awscli.zip
ADD "https://awscli.amazonaws.com/awscli-exe-linux-x86_64-${AWSCLI_VERSION}.zip.sig" /extras/awscli.sig
ADD "https://get.helm.sh/helm-v${HELM_VERSION}-linux-amd64.tar.gz" /extras/helm.tgz
ADD "https://github.com/mikefarah/yq/releases/download/v${YQ_VERSION}/${YQ_BINARY}.tar.gz" /extras/yq.tgz
ADD "https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/${OC_VERSION}/openshift-client-linux.tar.gz" /extras/oc.tgz
ADD "https://releases.hashicorp.com/vault/${VAULT_VERSION}/vault_${VAULT_VERSION}_linux_amd64.zip" /extras/vault.zip
 
COPY aws-cli.pub /extras/aws-cli.pub
 
RUN cd /extras && \
    apt update && \
    apt install -y --no-install-recommends \
        ca-certificates \
        curl \
        gawk \
        gettext \
        git \
        gnupg2 \
        jq \
        openssh-client \
        unzip && \
    gpg --import /extras/aws-cli.pub && \
    # curl -L "https://awscli.amazonaws.com/awscli-exe-linux-x86_64-${AWSCLI_VERSION}.zip" -o /extras/awscli.zip && \
    # curl -L "https://awscli.amazonaws.com/awscli-exe-linux-x86_64-${AWSCLI_VERSION}.zip.sig" -o /extras/awscli.sig && \
    gpg --verify awscli.sig awscli.zip && \
    unzip -qq awscli.zip && \
    /extras/aws/install --update && \
    rm -rf /extras/aws* && \
    # curl -L "https://get.helm.sh/helm-v${HELM_VERSION}-linux-amd64.tar.gz" -o /extras/helm.tgz && \
    # curl -L "https://github.com/mikefarah/yq/releases/download/v${YQ_VERSION}/${YQ_BINARY}.tar.gz" -o /extras/yq.tgz && \
    # curl -L "https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/${OC_VERSION}/openshift-client-linux.tar.gz" -o /extras/oc.tgz && \
    # curl -L "https://releases.hashicorp.com/vault/${VAULT_VERSION}/vault_${VAULT_VERSION}_linux_amd64.zip" -o /extras/vault.zip && \
    find . -type f -name '*.tgz' -exec tar -xzf {} \; && \
    find . -type f -name '*.zip' -exec unzip -qq {} \; && \
    find . -type f -perm /101 -exec mv {} /usr/local/bin/ \; && \
    mv /usr/local/bin/${YQ_BINARY} /usr/local/bin/yq && \
    find /extras/ -mindepth 1 -delete && \
    apt clean && rm -rf /var/lib/apt/lists/*
 
ENTRYPOINT []

. I don't understand why this happens with this particular Dockerfile, because essentially I'm doing exactly the same things.

Any ideas?


Solution

  • You notice this, because ADDed files do not disappear from older image layers even if you remove them later. Consider the following dockerfiles:

    # a
    FROM alpine:latest
    RUN apk add --no-cache curl
    
    ADD https://www.python.org/ftp/python/3.10.7/Python-3.10.7.tar.xz Python.tar.xz
    RUN rm Python.tar.xz
    
    # b
    FROM alpine:latest
    RUN apk add --no-cache curl
    
    RUN curl -o Python.tar.xz https://www.python.org/ftp/python/3.10.7/Python-3.10.7.tar.xz 
    RUN rm Python.tar.xz
    
    # c
    FROM alpine:latest
    RUN apk add --no-cache curl
    
    RUN curl -o Python.tar.xz https://www.python.org/ftp/python/3.10.7/Python-3.10.7.tar.xz && \
        rm Python.tar.xz
    

    Building each of them in the same context, I got the following results:

    REPOSITORY   TAG       IMAGE ID       CREATED          SIZE
    <none>       <none>    cc79832a5ffa   9 seconds ago    27.3MB
    <none>       <none>    87ea16448764   13 seconds ago   7.68MB
    <none>       <none>    7f794f03b960   18 seconds ago   27.3MB
    alpine       latest    9c6f07244728   5 weeks ago      5.54MB
    

    (guess which file yields different result)

    If at some point you "finished" a layer with some files you don't need in final image - you wasted the space. So your single RUN command is the most efficient. To improve readability, you may try to adapt multi-stage build here, so that all curl/ADD, unzip/tar -x commands are isolated on build stage, and then you have only required binaries to copy from build stage to deploy stage. I'm not sure however that you'll gain much here.