Search code examples
pythondockertesseractpython-tesseractpymupdf

PyMuPdf (fitz) inaccessible in docker


I'm trying to get some OCR done in a docker file and since I couldn't get it to work with Tesseract I tried refactor to use PyMuPdf instead. The error I get is quite simple:

File "/code/table.py", line 35, in <module>
    import fitz
ModuleNotFoundError: No module named 'fitz'

On my local (windows) machine I'm able to get it running with code that looks like this

import fitz
pages = fitz.open(source_path)  # open document
for page in pages:
   page_data = page.get_textpage_ocr(language='eng', dpi=600, full=True)
<etc>

However in Docker the same exact code does not work.

Relevant parts of my Dockerfile look like this

FROM python:3.10
WORKDIR /code
COPY ./requirements.txt ./
RUN pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt

# install PyMupdf
RUN pip install pymupdf

COPY . .

CMD ["python", "./run.py"]

I also have pymupdf in my requirements file, but I install it separately just in case. Building the image gives no errors and works as it should.

Relevant parts of Docker-compose.yml

build: .
container_name: ocr
command: python ./run.py
volumes:
  - .:/code
  - type: bind
    source: "C:/Program Files/Tesseract-OCR/tessdata"
    target: /code/tessdata

And in my .env file I have a reference to the binded mount TESS_DATA_PREFIX='/code/tessdata

I've added TESS_DATA_PREFIX to my environment variables, although it does not seem necessary anymore, and the error happens way before I try to even use OCR.


Solution

  • The issue was related to Docker not updating after changes during builds. Removed all containers and build cache and now it works.

    EDIT: also, the correct ENV variable should be called TESSDATA_PREFIX, not TESS_DATA_PREFIX. This was my next error but after changing .env to the correct variable name the code works exactly as configured above.