Search code examples
pythondockerdebiantesseract

Packages not installed during Docker build


I'm trying to install tesseract-ocr in a Docker container based on the python:3.10 image. During the build process it looks like installation goes fine, but then I cannot find the files inside the container. If I then open up the container and install it from within the container it works.

Relevant parts of my Dockerfile looks like this

# debian based
FROM python:3.10
WORKDIR /code
RUN mkdir __logger

RUN apt-get update -y
RUN apt-get install apt-utils -y

# tesseract part, tried both apt & apt-get
RUN apt-get install tesseract-ocr -y

COPY ./requirements.txt ./
RUN pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "./app.py"]

Then I run the container with docker compose up and go into the container with docker exec -t -i my_container_name /bin/bash and finally try find / -type d -name "*tesseract*" which yields no results.

If I run apt-cache search tesseract-ocr I can see it is available in the list.

If I then run apt install tesseract-ocr inside the container terminal, I can see the files are installed. And then if I run find / -type d -name "*tesseract*" again, I can see that now tesseract was installed

root@06d4e841c6d2:/code# find / -type d -name "*tess*"
/usr/share/doc/tesseract-ocr-eng
/usr/share/doc/tesseract-ocr-osd
/usr/share/doc/tesseract-ocr
/usr/share/doc/libtesseract4
/usr/share/tesseract-ocr
/usr/share/tesseract-ocr/4.00/tessdata
/usr/share/tesseract-ocr/4.00/tessdata/tessconfigs

How can I make it work so that it is installed correctly during the build phase?

Here's a snippet of the logs towards the end of the build process for RUN apt-get install tesseract-ocr -y

#18 4.079 Preparing to unpack .../5-tesseract-ocr-osd_1%3a4.00~git30-7274cfa-1.1_all.deb ...
#18 4.086 Unpacking tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...
#18 4.447 Selecting previously unselected package tesseract-ocr.
#18 4.451 Preparing to unpack .../6-tesseract-ocr_4.1.1-2.1_amd64.deb ...
#18 4.463 Unpacking tesseract-ocr (4.1.1-2.1) ...
#18 4.552 Setting up libarchive13:amd64 (3.4.3-2+deb11u1) ...
#18 4.574 Setting up tesseract-ocr-eng (1:4.00~git30-7274cfa-1.1) ...
#18 4.596 Setting up libgif7:amd64 (5.1.9-2) ...
#18 4.618 Setting up tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...
#18 4.640 Setting up liblept5:amd64 (1.79.0-1.1+deb11u1) ...
#18 4.665 Setting up libtesseract4:amd64 (4.1.1-2.1) ...
#18 4.688 Setting up tesseract-ocr (4.1.1-2.1) ...
#18 4.710 Processing triggers for libc-bin (2.31-13+deb11u6) ...
#18 DONE 4.8s 

Solution

  • I'm unable to reproduce your problem. I created a docker image with this truncated Dockerfile

    # debian based
    FROM python:3.10
    WORKDIR /code
    RUN mkdir __logger
    
    RUN apt-get update -y
    RUN apt-get install apt-utils -y
    
    # tesseract part, tried both apt & apt-get
    RUN apt-get install tesseract-ocr -y
    

    and then built the docker image like docker build --tag stackoverflow:test .

    and then logged into a container and was able to find tesseract like

    % docker run -it stackoverflow:test /bin/bash
    root@2e2e3599c939:/code# find / -type d -name "*tess*"
    /usr/share/doc/tesseract-ocr
    /usr/share/doc/libtesseract4
    /usr/share/doc/tesseract-ocr-osd
    /usr/share/doc/tesseract-ocr-eng
    /usr/share/tesseract-ocr
    /usr/share/tesseract-ocr/4.00/tessdata
    /usr/share/tesseract-ocr/4.00/tessdata/tessconfigs
    

    So this problem is a bit of stumper. But here are a few things that you can try that might help...

    1. try to build the docker container by itself, not using docker compose
    2. when building, to try to remove caching with --no-cache argument to docker
    3. make sure that you are running the newest version of Docker