Search code examples
pythondockerpip

Improving docker build time for pip based python application


We have a python project, and the current docker build takes 350s. Here is the current Dockerfile

FROM python:3.9

RUN apt-get update && \
    apt-get install -y python2.7

WORKDIR /var/app
COPY . .
RUN pip install

ENTRYPOINT ["python3", "/var/app/src/main.py"]

This took 350s on every docker build There was obvious room for improvement here so, I changed it to this

FROM python:3.9

RUN apt-get update && \
    apt-get install -y python2.7

WORKDIR /var/app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
RUN pip install

ENTRYPOINT ["python3", "/var/app/src/main.py"]

This brings subsequent build down to 1s

I also came across the following after a bit of searching,

FROM python:3.9

RUN apt-get update && \
    apt-get install -y python2.7

WORKDIR /var/app
COPY requirements.txt .
RUN  --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt
COPY . .
RUN pip install

ENTRYPOINT ["python3", "/var/app/src/main.py"]

This takes a bit longer around 80s but better than the 1st one

  • What I don't understand is what are the caveats with #2 (to be safe).
  • When should I use #3

I have no experience with pip/docker fyi

With #2, as a side effect of docker caching layers, if any of my dependencies dependency version changes, because of using range operators, then I still won't rebuild anything. Is that what #3 is trying to solve? i.e. it will reuse cache as much as it can but also makes sure to update anything necessary

If not, I have no idea what's going on with #3. Is the /root/.cache/pip a pip specific directory or it can be anything?


Solution

  • The difference between the first two is a Docker thing. Docker maintains a build cache: if the previous step was cached, and either this RUN instruction is one that was run before or this COPY instruction copies identical files, then Docker skips building that layer and uses the output from the previous time you built it.

    So in the first form, whenever any file changes, the COPY line invalidates the build cache, and you have to repeat the RUN pip install line. But in the second form, you only COPY requirements.txt . first. If that file hasn't changed, then you're still working off the build cache, and you can also skip the RUN pip install line. This is standard Docker practice; you're not missing out on anything and there aren't really any caveats.


    The difference between the last two is a pip thing. Pip also maintains a cache of things it's downloaded. Without Docker, if you run pip install -r requirements.txt a second time, pip will find its local cache and avoid re-downloading files.

    In Docker, though, every image build starts from the same empty image. The RUN --mount option mounts a persistent directory to be the pip cache. This on the one hand gets you a place to store the files that can be reused, and on the other avoids storing the redundant wheel files in the final image. The mount target directory needs to be the same location as pip is expecting its cache to be.

    What you should find is, if you haven't changed requirements.txt, then the build again skips over the RUN pip install step, the same as in your second Dockerfile. If you have, then it will at least skip downloading many of the packages. Depending on your network speed, this can lead to subsequent rebuilds running faster.


    ... if any of my dependencies dependency version changes ...

    Most package managers have two different copies of a version list, the user-managed dependency file and a separate lock file that lists exact versions that are to be installed.

    In Python the situation is a little confusing because there are several different package managers. If I wasn't using an alternate tool like Pipfile or Poetry, then the setup I'd suggest is using the standard Setuptools library. List your application's dependencies in a setup.cfg file

    [options]
    install_requires =
      some-package >=1.0,<2.0
    

    On your host system, use this to do the installation into a virtual environment, then run pip freeze to generate the requirements.txt file.

    python3 -m venv ./venv
    . ./venv/bin/activate
    pip install -e .
    pip freeze > requirements.txt
    

    Now the requirements.txt file contains the exact version of every library you directly and indirectly depend on. Do not hand-edit this file; do not regenerate it in your Dockerfile; do check it into source control. With the pair of dependency files, requirements.txt is now functionally the lock file.

    Also see the pip documentation on Repeatable Installs. Both Pipfile and Poetry maintain their own lock files, if you choose to use one of those package managers instead.