I am trying to use pytesseract on Jupyter Notebook.
When I run the following code:
try:
import Image
except ImportError:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en', config = tessdata_dir_config))
I get the following error:
TesseractError Traceback (most recent call last)
<ipython-input-37-c1dcbc33cde4> in <module>()
11 # tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
12
---> 13 print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en'))
14 # print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))
C:\Users\cpcho\AppData\Local\Continuum\Anaconda3\lib\site-packages\pytesseract\pytesseract.py in image_to_string(image, lang, boxes, config)
123 if status:
124 errors = get_errors(error_string)
--> 125 raise TesseractError(status, errors)
126 f = open(output_file_name, 'rb')
127 try:
TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata')
I found these two references helpful but I am missing something: https://github.com/madmaze/pytesseract/issues/50 https://github.com/madmaze/pytesseract/issues/64
Thank you for your time on this!
From your post, observed two possible issues.
All the trained language data should be saved in TESSDATA_PREFIX
,
a Windows environmental variable, which is at C:\Program Files
(x86)\Tesseract-OCR\tessdata
in your case.
The tesseract
trained English data is named eng.traineddata
(i.e. 'eng'
) unless you modified its name. Refer to this Tesseract Data Files for more information.
In addition, for pytesseract
to read the image file Image.open()
, you may include the full file path (e.g. 'z:\\path\\to\\image'
) if the image file is unable to locate.
Hope to this.