Search code examples
gitdockercaching

How to get git clone to play nice with Docker cache?


When I clone a repository twice, eg.:

git clone <repo_X> --depth 1 clone1
git clone <repo_X> --depth 1 clone2

and then do a diff

diff -r clone1 clone2

This shows differences:

Binary files clone1/.git/index and clone2/.git/index differ
...
diff -r clone1/.git/logs/HEAD clone2/.git/logs/HEAD
...
diff -r clone1/.git/logs/refs/remotes/origin/HEAD 
...

It seems that among others the time when cloning is recorded in a file.

I want to add some repositories to a Docker Image. Docker uses its cache when the files are not changed. Unfortunately after a clone Docker always invalidates the cache due to the changed files.

  1. Is it somehow possible to have two clones of a repo result in exactly the same files? (Note.: I don't want to remove the .git directory as I want to be able to use git inside the image to check the version of the repo.)

  2. Is it possible to let Docker ignore the .git folder when it comes to caching (Note that the .git folder still must be added to the image, so .dockerignore is not an option?)


Solution

  • You can use new Docker's BuildKit's feature --mount=cache. Toy example of Dockerfile:

    FROM ubuntu
    RUN --mount=type=cache,target=/var/cache/apt \
        apt update && apt upgrade -y && apt install -yq git
    RUN echo A00
    RUN --mount=type=cache,target=/tmp/git_cache/ \
        git clone --depth=1 https://github.com/qtox/qtox/ /tmp/git_cache/qtox/; \
        cd /tmp/git_cache/qtox/ && git pull && cp -r ./ /tmp/my_qtox/
    RUN echo B00
    

    Above dockerfile can be built by command:

    sudo env DOCKER_BUILDKIT=1 docker build -f Dockerfile .

    notice presence of DOCKER_BUILDKIT=1 environment variable, it is necessary to enable all BuildKit's features inside docker build. You can read about BuildKit's features here.

    For example I cloned qTox repo above as it is quite huge.

    --mount=cache feature automatically creates temporary directory meant for caching and mounts it into /tmp/git_cache/ (target) inside container. If some previous layers changed, e.g. echo A00 changed to echo A01 then this cloning is done immediately without delay because it is just taken from cache.

    Also as you requested using this cache will make cloning repository being totally same. Only when new commits appear inside repository then git pull is done and repository changes. Unless there new commits this cached repository will stay the same. Hence you'll have identical git repo every time when you run docker build again.

    Only rarely Docker will automatically delete cached directory if it wasn't used for long time or if you have low disk free space.

    As you can see from docker-file above final git repo will appear inside /tmp/my_qtox/ folder of container. You may change this path to whatever you need for your case.

    Also you may have noticed that I used same caching mechanism when installing APT packages. This is very handy because when image is rebuilt all packages are not redownloaded from remote Ubuntu server, but taken from cached directory. It is useful when previous docker layers before apt install have changed or when you add new apt packages to installation list, in both cases apt install will re-run very fast.