Search code examples
pythonpython-3.xdockerdockerfiletesseract

TesseractNotFound issue when containerizing in docker


Problem:

I had tesseract installed in local machine and its path is at /usr/local/Cellar/tesseract/4.1.1/bin/tesseract. Everything works perfectly until I containerized it in docker with error message as: pytesseract.pytesseract.TesseractNotFoundError: is not installed or it's not your PATH

What I've tried:

Based on the error message, this is what I've tried:

1). Add PATH in docker desktop app under file sharing to /usr/local and mount the file path from local to docker - still getting the error message (doesn't work)

2). Move tesseract.exe from where it resides to current local working dir - still getting the error message(of course it doesn't work - what was I even thinking back then?)

3). Modify dockerfile to install tesseract with its dependencies. Here is the dockerfile:

FROM python:3.7-alpine
RUN apk update && apk add --no-cache tesseract-ocr
WORKDIR /app
COPY ./requirements.txt ./ 
RUN pip3 install --upgrade pip
# install dependencies 
RUN pip3 install -r requirements.txt
RUN pip3 install --upgrade PyMuPDF
# bundle app source 
COPY . /app

COPY ./ChaseOCR.py /app
COPY ./BancAmericaOCR.py /app
COPY ./WellsFargoOCR.py /app

EXPOSE 8080

CMD ["python3", "MainBankClass.py"] 

Under requirements.txt file, pytesseract and tesseract dependencies are also included. - still getting the error message (doesn't work). Being stuck on this issue in the past 2 days and kinda running out of options here. This link and this link both don't work on my case. Any help is much appreciated. Thanks in advance.

EDIT:

Thanks to Neo's solution and I am testing it now but its running very slowly. Thus I thought it would be better to share requirements.txt file here just in case other issues are non-related to tesseract.

requirements.txt:

numpy
pandas
opencv-python
Pillow
Image
pytesseract
tesseract
PyMuPDF
python-levenshtein
tabula-py

Local file dir:

testdockerfile
├─ .vscode
│  └─ settings.json
├─ BankofAmericaOCR.py
├─ ChaseOCR.py
├─ Dockerfile
├─ MainBankClass.py
|- __init__.py
├─ WellsFargoOCR.py
└─ requirements.txt

EDIT 2:

Just for future reference if anyone has the same issue as I did after implementing tesseract in docker and still getting TesseractNotFound issue. What you need to do is to comment out pytesseract.pytesseract.tesseract_cmd = r'/path/to/your/tesseract if you set the path to run it locally. After that, you also need to re-build the image and run that image in docker. It should be fine.


Solution

  • Edit 3:
    Some of the python packages in requirements.txt have other prerequisites. With this Dockerfile it went successfully through the entire build process.

    The trickiest part was to build opencv.
    Credits to https://github.com/janza/docker-python3-opencv/blob/master/Dockerfile

    .
    ├── Dockerfile
    └── requirements.txt
    

    Dockerfile:

    FROM python:3.7
    
    RUN apt-get update \
        && apt-get install -y \
            build-essential \
            cmake \
            git \
            wget \
            unzip \
            yasm \
            pkg-config \
            libswscale-dev \
            libtbb2 \
            libtbb-dev \
            libjpeg-dev \
            libpng-dev \
            libtiff-dev \
            libavformat-dev \
            libpq-dev \
        && rm -rf /var/lib/apt/lists/*
    
    RUN pip install numpy
    
    WORKDIR /
    ENV OPENCV_VERSION="4.1.1"
    RUN wget https://github.com/opencv/opencv/archive/${OPENCV_VERSION}.zip \
    && unzip ${OPENCV_VERSION}.zip \
    && mkdir /opencv-${OPENCV_VERSION}/cmake_binary \
    && cd /opencv-${OPENCV_VERSION}/cmake_binary \
    && cmake -DBUILD_TIFF=ON \
      -DBUILD_opencv_java=OFF \
      -DWITH_CUDA=OFF \
      -DWITH_OPENGL=ON \
      -DWITH_OPENCL=ON \
      -DWITH_IPP=ON \
      -DWITH_TBB=ON \
      -DWITH_EIGEN=ON \
      -DWITH_V4L=ON \
      -DBUILD_TESTS=OFF \
      -DBUILD_PERF_TESTS=OFF \
      -DCMAKE_BUILD_TYPE=RELEASE \
      -DCMAKE_INSTALL_PREFIX=$(python3.7 -c "import sys; print(sys.prefix)") \
      -DPYTHON_EXECUTABLE=$(which python3.7) \
      -DPYTHON_INCLUDE_DIR=$(python3.7 -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())") \
      -DPYTHON_PACKAGES_PATH=$(python3.7 -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())") \
      .. \
    && make install \
    && rm /${OPENCV_VERSION}.zip \
    && rm -r /opencv-${OPENCV_VERSION}
    RUN ln -s \
      /usr/local/python/cv2/python-3.7/cv2.cpython-37m-x86_64-linux-gnu.so \
      /usr/local/lib/python3.7/site-packages/cv2.so
    
    RUN apt-get --fix-missing update && apt-get --fix-broken install && apt-get install -y poppler-utils && apt-get install -y tesseract-ocr && \
        apt-get install -y libtesseract-dev && apt-get install -y libleptonica-dev && ldconfig && apt install -y libsm6 libxext6 && apt install -y python-opencv
    
    COPY ./requirements.txt ./ 
    RUN pip3 install --upgrade pip
    # install dependencies 
    RUN pip3 install -r requirements.txt
    

    Build:

    docker image build -t my-awesome-py .
    

    Run:

    docker run --rm my-awesome-py tesseract
    Usage:
      tesseract --help | --help-extra | --version
      tesseract --list-langs
      tesseract imagename outputbase [options...] [configfile...]
    
    OCR options:
      -l LANG[+LANG]        Specify language(s) used for OCR.
    NOTE: These options must occur before any configfile.
    
    Single options:
      --help                Show this help message.
      --help-extra          Show extra help for advanced users.
      --version             Show version information.
      --list-langs          List available languages for tesseract engine.