Search code examples
pythonmacostesseractpython-tesseract

Pytesseract Failed loading language 'chi-sim'


I am working on python tesseract package with sample code like the follows:

import pytesseract
from PIL import Image

tessdata_dir_config = "--tessdata-dir \"/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/\""
image = Image.open("dataset/test.jpeg")
text = pytesseract.image_to_string(image, lang = "chi-sim", config = tessdata_dir_config)
print(text)

And I received the following error message:

pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/chi-sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'chi-sim' Tesseract couldn't load any languages! Could not initialize tesseract.')

From my understanding, the error occurred when reading the file chi-sim.traineddata (which stands for Simplified Chinese), as I will explain the attempts I have made to settle this problem below.

  • My developing environment is M1 macOS, and I installed tesseract and tesseract-lang from Homebrew. I am pretty sure that the path specified above is exactly where the source files are located, since when I call
print(pytesseract.get_languages(config = ""))

I get a long list of languages printed, including chi-sim.

  • Further, if we just use English instead of Chinese, the following code can successfully recognize the English texts in an image:
text = pytesseract.image_to_string(image)
  • I've tried to specify environment variable TESSDATA_PREFIX in multiple ways, including:
  1. Using config parameter as in the original code.

  2. Adding global environment variable in PyCharm.

  3. Adding the following line in the code

os.environ["TESSDATA_PREFIX"] = "tesseract/4.1.1/share/tessdata/"
  1. Adding the following line to bash_profile in terminal
export TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/

But unfortunately, none of these works.

  • It seems as if my file chi-sim.traineddata is, somehow, broken, so I directly downloaded the trained data file from GitHub (https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata), hit the "Download" button on the right, and placed the downloaded file in the tesseract-lang and original tesseract directory (where eng.traineddata is located). Yes, I've tried both, but neither works.

With respect to this issue, is there any potential solutions?


Solution

  • Code works for me on Linux if I use lang="chi_sim" with _ instead of - because file downloaded from server has name chi_sim.traineddata also with _ instead of -.


    If I rename file into chi-sim.traineddata then I can use lang="chi-sim" (with - instead of _)