Search code examples
pythontesseractpython-tesseract

pytesseract not recognizing symbols in front of letters


Trying to use pytesseract to read a few blocks of text but it isn't recognizing symbols when they are in front of or between words. It does however recognize the symbols when they are in front of numbers.

Example:

'#test $test %test' on the image prints wrong 'Htest Stest Stest'

'#500 $500 %500' on the image prints correct '#500 $500 %500'

Here is my code:

    import cv2
    import pytesseract
    from PIL import Image

    image = cv2.imread("test.png")
    image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    threshold = 225
    _, img_binarized = cv2.threshold(image, threshold, 255, cv2.THRESH_BINARY)
    pil_img = Image.fromarray(img_binarized)

    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

    msg = pytesseract.image_to_string(pil_img)
    print(msg)

I have played around with a bunch of different config settings in the image_to_string call but haven't found anything that works, any help is appreciated.


Solution

  • I ended up downloading all the .traineddata files from https://tesseract-ocr.github.io/tessdoc/Data-Files.html to my Tesseract-OCR folder and looping through all of them using the language parameter of image_to_string. For some reason a few select languages that share the same alphabet as English worked just fine (Italian and Croatian worked best).

    My code is the same as above but language is adjusted:

    msg = pytesseract.image_to_string(pil_img, lang='ita')