I am working on python tesseract package with sample code like the follows:
import pytesseract
from PIL import Image
tessdata_dir_config = "--tessdata-dir \"/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/\""
image = Image.open("dataset/test.jpeg")
text = pytesseract.image_to_string(image, lang = "chi-sim", config = tessdata_dir_config)
print(text)
And I received the following error message:
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/chi-sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'chi-sim' Tesseract couldn't load any languages! Could not initialize tesseract.')
From my understanding, the error occurred when reading the file chi-sim.traineddata
(which stands for Simplified Chinese), as I will explain the attempts I have made to settle this problem below.
tesseract
and tesseract-lang
from Homebrew. I am pretty sure that the path specified above is exactly where the source files are located, since when I callprint(pytesseract.get_languages(config = ""))
I get a long list of languages printed, including chi-sim.
text = pytesseract.image_to_string(image)
TESSDATA_PREFIX
in multiple ways, including:Using config
parameter as in the original code.
Adding global environment variable in PyCharm.
Adding the following line in the code
os.environ["TESSDATA_PREFIX"] = "tesseract/4.1.1/share/tessdata/"
bash_profile
in terminalexport TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/
But unfortunately, none of these works.
chi-sim.traineddata
is, somehow, broken, so I directly downloaded the trained data file from GitHub (https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata), hit the "Download" button on the right, and placed the downloaded file in the tesseract-lang and original tesseract directory (where eng.traineddata
is located). Yes, I've tried both, but neither works.With respect to this issue, is there any potential solutions?
Code works for me on Linux if I use lang="chi_sim"
with _
instead of -
because file downloaded from server has name chi_sim.traineddata
also with _
instead of -
.
If I rename file into chi-sim.traineddata
then I can use lang="chi-sim"
(with -
instead of _
)