Problem:
I had tesseract
installed in local machine and its path is at /usr/local/Cellar/tesseract/4.1.1/bin/tesseract
. Everything works perfectly until I containerized it in docker with error message as: pytesseract.pytesseract.TesseractNotFoundError: is not installed or it's not your PATH
What I've tried:
Based on the error message, this is what I've tried:
1). Add PATH in docker desktop app under file sharing to /usr/local
and mount the file path from local to docker - still getting the error message (doesn't work)
2). Move tesseract.exe
from where it resides to current local working dir - still getting the error message(of course it doesn't work - what was I even thinking back then?)
3). Modify dockerfile to install tesseract with its dependencies. Here is the dockerfile:
FROM python:3.7-alpine
RUN apk update && apk add --no-cache tesseract-ocr
WORKDIR /app
COPY ./requirements.txt ./
RUN pip3 install --upgrade pip
# install dependencies
RUN pip3 install -r requirements.txt
RUN pip3 install --upgrade PyMuPDF
# bundle app source
COPY . /app
COPY ./ChaseOCR.py /app
COPY ./BancAmericaOCR.py /app
COPY ./WellsFargoOCR.py /app
EXPOSE 8080
CMD ["python3", "MainBankClass.py"]
Under requirements.txt file, pytesseract and tesseract dependencies are also included. - still getting the error message (doesn't work). Being stuck on this issue in the past 2 days and kinda running out of options here. This link and this link both don't work on my case. Any help is much appreciated. Thanks in advance.
EDIT:
Thanks to Neo's solution and I am testing it now but its running very slowly. Thus I thought it would be better to share requirements.txt file here just in case other issues are non-related to tesseract.
requirements.txt:
numpy
pandas
opencv-python
Pillow
Image
pytesseract
tesseract
PyMuPDF
python-levenshtein
tabula-py
Local file dir:
testdockerfile
├─ .vscode
│ └─ settings.json
├─ BankofAmericaOCR.py
├─ ChaseOCR.py
├─ Dockerfile
├─ MainBankClass.py
|- __init__.py
├─ WellsFargoOCR.py
└─ requirements.txt
EDIT 2:
Just for future reference if anyone has the same issue as I did after implementing tesseract
in docker and still getting TesseractNotFound issue. What you need to do is to comment out pytesseract.pytesseract.tesseract_cmd = r'/path/to/your/tesseract
if you set the path to run it locally. After that, you also need to re-build the image and run that image in docker. It should be fine.
Edit 3:
Some of the python packages in requirements.txt
have other prerequisites.
With this Dockerfile
it went successfully through the entire build process.
The trickiest part was to build opencv
.
Credits to https://github.com/janza/docker-python3-opencv/blob/master/Dockerfile
.
├── Dockerfile
└── requirements.txt
Dockerfile:
FROM python:3.7
RUN apt-get update \
&& apt-get install -y \
build-essential \
cmake \
git \
wget \
unzip \
yasm \
pkg-config \
libswscale-dev \
libtbb2 \
libtbb-dev \
libjpeg-dev \
libpng-dev \
libtiff-dev \
libavformat-dev \
libpq-dev \
&& rm -rf /var/lib/apt/lists/*
RUN pip install numpy
WORKDIR /
ENV OPENCV_VERSION="4.1.1"
RUN wget https://github.com/opencv/opencv/archive/${OPENCV_VERSION}.zip \
&& unzip ${OPENCV_VERSION}.zip \
&& mkdir /opencv-${OPENCV_VERSION}/cmake_binary \
&& cd /opencv-${OPENCV_VERSION}/cmake_binary \
&& cmake -DBUILD_TIFF=ON \
-DBUILD_opencv_java=OFF \
-DWITH_CUDA=OFF \
-DWITH_OPENGL=ON \
-DWITH_OPENCL=ON \
-DWITH_IPP=ON \
-DWITH_TBB=ON \
-DWITH_EIGEN=ON \
-DWITH_V4L=ON \
-DBUILD_TESTS=OFF \
-DBUILD_PERF_TESTS=OFF \
-DCMAKE_BUILD_TYPE=RELEASE \
-DCMAKE_INSTALL_PREFIX=$(python3.7 -c "import sys; print(sys.prefix)") \
-DPYTHON_EXECUTABLE=$(which python3.7) \
-DPYTHON_INCLUDE_DIR=$(python3.7 -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())") \
-DPYTHON_PACKAGES_PATH=$(python3.7 -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())") \
.. \
&& make install \
&& rm /${OPENCV_VERSION}.zip \
&& rm -r /opencv-${OPENCV_VERSION}
RUN ln -s \
/usr/local/python/cv2/python-3.7/cv2.cpython-37m-x86_64-linux-gnu.so \
/usr/local/lib/python3.7/site-packages/cv2.so
RUN apt-get --fix-missing update && apt-get --fix-broken install && apt-get install -y poppler-utils && apt-get install -y tesseract-ocr && \
apt-get install -y libtesseract-dev && apt-get install -y libleptonica-dev && ldconfig && apt install -y libsm6 libxext6 && apt install -y python-opencv
COPY ./requirements.txt ./
RUN pip3 install --upgrade pip
# install dependencies
RUN pip3 install -r requirements.txt
Build:
docker image build -t my-awesome-py .
Run:
docker run --rm my-awesome-py tesseract
Usage:
tesseract --help | --help-extra | --version
tesseract --list-langs
tesseract imagename outputbase [options...] [configfile...]
OCR options:
-l LANG[+LANG] Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.
Single options:
--help Show this help message.
--help-extra Show extra help for advanced users.
--version Show version information.
--list-langs List available languages for tesseract engine.