pdf ocr python-3.7 tesseract python-tesseract

Issues with running OCRmyPDF: A WinError2 and an Invalid Version Number

So I am putting together a simple Python script to OCR a PDF:

from PIL import Image
from tika import parser
import argparse
import img2pdf
import ocrmypdf

def main():
    
    parser = argparse.ArgumentParser(description="Get text from image.")
    parser.add_argument("image_path", metavar="i", help="The path to the image being used.")
    args = parser.parse_args()
    image_path = args.image_path
    
    pdf_from_image_file_name = convert_to_pdf(image_path)
    pdf_w_ocr_file_name = ocr_pdf()
    raw_text_from_ocr_pdf = get_text_from_pdf()
    print(raw_text_from_ocr_pdf)
    
def convert_to_pdf(image_path, new_pdf_file_name="pdf_from_image"):
    temp_image = Image.open(image_path)
    pdf_bytes = img2pdf.convert(temp_image.filename)
    new_file = open('./' + new_pdf_file_name + '.pdf', 'wb')
    new_file.write(pdf_bytes)
    temp_image.close()
    new_file.close()
    return new_pdf_file_name

def ocr_pdf(pdf_file_path="./temp_pdf_file_name.pdf", new_pdf_file_name="pdf_w_ocr.pdf"):
    ocrmypdf.ocr(pdf_file_path, './'+new_pdf_file_name, deskew=True)
    return new_pdf_file_name

def get_text_from_pdf(pdf_file_path="./pdf_w_ocr.pdf"):
    raw_pdf = parser.from_file(pdf_file_path)
    return raw_pdf['content']
    
if __name__ == '__main__':
    main()

When the script hits import ocrmypdf it triggers a [WinError 2] The system cannot find the file specified error but continues past it. The conversion from JPG or PNG to PDF works and outputs just fine. However, when reaching the ocrmypdf.ocr(pdf_file_path, './'+new_pdf_file_name, deskew=True) I get a ValueError: invalid version number '4.0.0.20181030'.

The full stack is:

[WinError 2] The system cannot find the file specified
Traceback (most recent call last):
  File "workshop_v1.py", line 71, in <module>
    main()
  File "workshop_v1.py", line 49, in main
    pdf_w_ocr_file_name = ocr_pdf()
  File "workshop_v1.py", line 63, in ocr_pdf
    ocrmypdf.ocr(pdf_file_path, './'+new_pdf_file_name, deskew=True)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\api.py", line 339, in ocr
    check_options(options, plugin_manager)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\_validation.py", line 271, in check_options
    _check_options(options, plugin_manager, ocr_engine_languages)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\_validation.py", line 266, in _check_options
    plugin_manager.hook.check_options(options=options)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\hooks.py", line 286, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\manager.py", line 93, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\manager.py", line 87, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\callers.py", line 208, in _multicall
    return outcome.get_result()
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\callers.py", line 80, in get_result
    raise ex[1].with_traceback(ex[2])
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\builtin_plugins\tesseract_ocr.py", line 84, in check_options
    version_parser=tesseract.TesseractVersion,
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\subprocess\__init__.py", line 313, in check_external_program
    if found_version and version_parser(found_version) < version_parser(need_version):
  File "C:\Users\xxx\anaconda3\envs\python37\lib\distutils\version.py", line 40, in __init__
    self.parse(vstring)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\_exec\tesseract.py", line 72, in parse
    super().parse(vstring)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\distutils\version.py", line 137, in parse
    raise ValueError("invalid version number '%s'" % vstring)
ValueError: invalid version number '4.0.0.20181030'

I'm running this on a x64 PC with Windows 10. Specifically, I'm running a Python 3.7.10 environment via Anaconda. Package version info in Python includes (via pip freeze):

pytesseract v0.3.7
ocrmypdf 12.1.0
ghostscript v0.7

Other potentially important version information outside python includes:

tesseract-ocr v4.0.0.20181030 (I've added and tried a number of environmental variables with this, detailed below)
leptonica v1.76.0
ghostscript v9.54.0
qpdf 10.3.2 (this was downloaded and then the files were placed in the C:/Windows/System32 directory)

Tesseract is installed here: C:\Program Files (x86)\Tesseract-OCR\, so I've tried the following environmental variables (as user variables):

OCRMYPDF_TESSERACT = C:\Program Files (x86)\Tesseract-OCR\tesseract.exe
Added C:\Program Files (x86)\Tesseract-OCR to the end of Path
TESSDATA_PREFIX = C:\Program Files (x86)\Tesseract-OCR\tessdata

Add pointers or ideas would be much appreciated!

Solution

The repository was updated here per the issue I opened here: https://github.com/jbarlow83/OCRmyPDF/issues/795.

To install use: pip3 install pip install git+https://github.com/jbarlow83/OCRmyPDF.git#egg=ocrmypdf.

I still get [WinError 2] The system cannot find the file specified, but it works so I'm not going to question it at this point.