Search code examples
dockerdockerfiledocker-machine

Where are files stored in docker daemon?


I tried to add a file via ADD command and then deleted it. But the size of docker images also shows that it includes that files! If I put * in .dockerignore, it will not work with ADD.

Dockerfile:

from ubuntu:20.04

ADD myfile /tmp

RUN rm /tmp/*

Then I built it by $ docker build -t testwf .

At the first stage it shows the below:

Sending build context to Docker daemon  34.21MB

The size of myfile file is around 33MB

$ docker images
REPOSITORY                       TAG       IMAGE ID       CREATED          SIZE
testwf                           latest    96543168ab34   16 minutes ago   107MB
ubuntu                           20.04     ba6acccedd29   5 weeks ago      72.8MB

Actually, I was supposed to get an image with 72.8MB the same size with ubuntu not 107MB which is roughly equal to 72.8MB plus 33MB! In other words, If I didn't have that file with ADD command, was there any way to access the file in the container as it was copied to Docker daemon?

update

As HansKilian mentioned in the comments that file went in one of the layers where the final image is constructed on top of that. Is there any way to get rid of that layer in order to decrease the size of the final image?

$ docker history testwf:latest                                                                  
IMAGE          CREATED         CREATED BY                                      SIZE      COMMENT
2af2733972ab   4 seconds ago   /bin/sh -c rm /tmp/*                            0B
40d13da4e0cc   4 seconds ago   /bin/sh -c #(nop) ADD file:0ddf694d27b108b4a…   34.2MB
ba6acccedd29   5 weeks ago     /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>      5 weeks ago     /bin/sh -c #(nop) ADD file:5d68d27cc15a80653…   72.8MB

Solution

  • There are several ways to "merge" intermediate layers in Docker:

    More details:

    In principle, each command in Dockerfile add a new "layer" containing the file system after command execution in the final image, what Docker helps here is that you may save each layer by only its diff from the previous layer, so we don't waste disk space for the same files.
    For example, if we execute an add then an remove commands on top of some layer 0, the add command create layer 1 including only the added file. The remove commands create layer 2 marking the file as removed. Since each layer only compares its diff with the previous layer, Docker don't know that layer 2 is identical to layer 0 during build. If we repeat the add/delete commands, every time we add, we create an extra layer with size equal to the file. As a result, we may build mulitple images with identical (final) content but varied size. For example, we may create a 32MB file and add/delete it twice to the same image like:

    from ubuntu:latest
    
    ADD big_file .
    RUN rm big_file
    ADD big_file .
    RUN rm big_file
    

    Building it with docker build . -t big_file:latest gives a image with size equal to <BASE_SIZE> + 32 MB * 2:

    REPOSITORY                          TAG        IMAGE ID       CREATED         SIZE
    big_file                            latest     ddd32b7a8519   2 minutes ago   140MB
    ubuntu                              latest     ba6acccedd29   5 weeks ago     72.8MB
    

    We can check layers within big_file by docker history <IMAGE> and get

    IMAGE          CREATED         CREATED BY                                      SIZE      COMMENT
    ddd32b7a8519   4 minutes ago   /bin/sh -c rm big_file                          0B        
    c20573523c30   4 minutes ago   /bin/sh -c #(nop) ADD file:937071a2cba4a5d8b…   33.6MB    
    80ae0642e3ad   4 minutes ago   /bin/sh -c rm big_file                          0B        
    0538ebbf489c   4 minutes ago   /bin/sh -c #(nop) ADD file:937071a2cba4a5d8b…   33.6MB    
    ba6acccedd29   5 weeks ago     /bin/sh -c #(nop)  CMD ["bash"]                 0B        
    <missing>      5 weeks ago     /bin/sh -c #(nop) ADD file:5d68d27cc15a80653…   72.8MB
    

    So what do above three methods do?

    • multi-stage build

    It throw away all layers in the previous stage and copy only specified files to the next stage. For example

    from ubuntu:latest
    
    ADD big_file .
    RUN rm big_file
    ADD big_file .
    RUN rm big_file
    ADD big_file .
    
    from ubuntu:latest
    COPY --from=0 big_file .
    

    Building it gives two image, one for stage-0, another for stage-1.

    REPOSITORY                          TAG        IMAGE ID       CREATED          SIZE
    <none>                              <none>     6d844f18d92e   5 seconds ago    173MB
    big_file                            latest     4b1025db1335   33 seconds ago   106MB
    ubuntu                              latest     ba6acccedd29   5 weeks ago      72.8MB
    

    Check the stage-1 image, it's clear that layers in stage-0 image is not copied. It contains only one extra layer created by COPY --from=0 big_file . command.

    IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
    4b1025db1335   47 seconds ago   /bin/sh -c #(nop) COPY file:937071a2cba4a5d8…   33.6MB    
    ba6acccedd29   5 weeks ago      /bin/sh -c #(nop)  CMD ["bash"]                 0B        
    <missing>      5 weeks ago      /bin/sh -c #(nop) ADD file:5d68d27cc15a80653…   72.8MB   
    

    It suits in situation you are clear what you need from the stage-0 image. A good example is you may compile in stage-0 and copy only the binary to stage-1. One common mistake, however, is that one may forget to copy dynamic libraries required by the binary which are missing in the stage-1 image as these two images are two different images with respective base image and layers.

    • --squash

    It's similar as squash in git. It loads and applies diff in each layer to create a new layer and use only the new layer in the built image.

    Building using docker build . -t big_file:latest --squash gives three image s

    REPOSITORY                          TAG        IMAGE ID       CREATED         SIZE
    big_file                            latest     6903fba8cef3   2 seconds ago   72.8MB
    <none>                              <none>     28ed65140111   3 seconds ago   140MB
    ubuntu                              latest     ba6acccedd29   5 weeks ago     72.8MB
    

    28ed65140111 is the image before squash, check layers in big_file

    IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
    6903fba8cef3   10 seconds ago                                                   0B        merge sha256:28ed65140111012d7604df5123b9be16ab4bfc62dd799259001b5d609ceb8e18 to sha256:ba6acccedd2923aee4c2acc6a23780b14ed4b8a5fa4e14e252a23b846df9b6c1
    <missing>      11 seconds ago   /bin/sh -c rm big_file                          0B        
    <missing>      12 seconds ago   /bin/sh -c #(nop) ADD file:937071a2cba4a5d8b…   0B        
    <missing>      13 seconds ago   /bin/sh -c rm big_file                          0B        
    <missing>      14 seconds ago   /bin/sh -c #(nop) ADD file:937071a2cba4a5d8b…   0B        
    <missing>      5 weeks ago      /bin/sh -c #(nop)  CMD ["bash"]                 0B        
    <missing>      5 weeks ago      /bin/sh -c #(nop) ADD file:5d68d27cc15a80653…   72.8MB 
    

    After load all diffs, there is nothing different from the base image, so the merge layer 6903fba8cef3 is empty. But squash build is currently an experimental feature.

    • export/import

    Notice that export works for a container rather than a image, it only dumps the current state of the container's file system, and ignores layer informations in the image. If we dump one container running our big_file image and then re-import it using docker export <CONTAINER_ID> > big_file.tar && docker import - big_file:load < big_file.tar, we get an "empty" image looks like:

    IMAGE          CREATED          CREATED BY   SIZE      COMMENT
    06f4d01022e7   13 seconds ago                72.8MB    Imported from -
    

    Now we can't know how the image is built since layers are not dumped.

    Which one is better really depends..., but the concept of layer in Docker is very important. Docker never forgets anything unless you drop or merge the layer somehow.