Search code examples
pythontesseractpython-tesseract

Pytesseract: Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata


I am trying to use pytesseract on Jupyter Notebook.

  • Windows 10 x64
  • Running Jupyter Notebook (Anaconda3, Python 3.6.1) with administrative privilege
  • The work directory containing TIFF file is in different drive (Z:)

When I run the following code:

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'

print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en', config = tessdata_dir_config))

I get the following error:

TesseractError                            Traceback (most recent call last)
<ipython-input-37-c1dcbc33cde4> in <module>()
     11 # tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
     12 
---> 13 print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en'))
     14 # print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

C:\Users\cpcho\AppData\Local\Continuum\Anaconda3\lib\site-packages\pytesseract\pytesseract.py in image_to_string(image, lang, boxes, config)
    123         if status:
    124             errors = get_errors(error_string)
--> 125             raise TesseractError(status, errors)
    126         f = open(output_file_name, 'rb')
    127         try:

TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata')

I found these two references helpful but I am missing something: https://github.com/madmaze/pytesseract/issues/50 https://github.com/madmaze/pytesseract/issues/64

Thank you for your time on this!


Solution

  • From your post, observed two possible issues.

    1. All the trained language data should be saved in TESSDATA_PREFIX, a Windows environmental variable, which is at C:\Program Files (x86)\Tesseract-OCR\tessdata in your case.

    2. The tesseract trained English data is named eng.traineddata (i.e. 'eng') unless you modified its name. Refer to this Tesseract Data Files for more information.

    In addition, for pytesseract to read the image file Image.open(), you may include the full file path (e.g. 'z:\\path\\to\\image') if the image file is unable to locate.

    Hope to this.