Search code examples
pythondockerpippytorch

How to prevent pip from re-downloading all packages when I rebuild the docker image after a minor change in the requirement list?


I have over 200 python packages in requirements.txt. When I rebuild the image after modifying or adding a package in the list, docker surprisingly re-downloads all packages despite that most packages in the list are not related to the change I make. This makes the building process unnecessarily over an hour long.

This problem only happens inside Docker. If I add an item and pip install -r requirements.txt outside docker, pip knows how to update and download the minimum amount of relevant packages instead of redoing it from scratch.

Here is how my Dockerfile look like:

ARG PYTORCH="1.9.0"
ARG CUDA="11.1"
ARG CUDNN="8"

FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel
...

...
WORKDIR /usr/src/app
RUN pip install cmake
ADD ./requirements.txt /usr/src/app/requirements.txt
RUN pip install -r requirements.txt
ADD . /usr/src/app
...

Solution

  • The problem:

    You have a problem with the docker cache.

    When you re-build an image, docker use cache but when you change the requirements.txt, docker understands that can't use cache for from ADD ./requirements.txt /usr/src/app/requirements.txt and since this step needs re-run the command (detail: docker image build only have the cache from the previous image but in this step you don't have any packages installed).

    The previous image cache is used to RUN pip install cmake but after docker needs re-run because the requirements.txt is changed.

    My recommendation:

    You can use different requirements files and sort by modification rate, because when docker detects changes stop using cache and re-run the full command.