Search code examples
pythonimageocrtesseractpython-tesseract

Python pytesseract - can't find eng.traineddata for -- oem 2


I am trying to extract text from a simple image.

sample_image

When I use the default engine (oem 3), the text is extracted (poorly). I would like to use the other engines (oem 2) to see if the output can improve.

import pytesseract

#this is the config that gives a poor output
config = '--tessdata-dir "C:/Program Files/Tesseract-OCR/tessdata" -l eng --oem 2 --psm 6'
text = pytesseract.image_to_string(crop, config=config)

When I try and pass the option to change the engine I get an error, saying that the language files aren't found:

pytesseract.pytesseract.TesseractError: (1, "Error: Tesseract (legacy) engine requested, but components are not present in C:/Program Files/Tesseract-OCR/tessdata/eng.traineddata!! Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.")

#experimental config
config = '--psm 6'
text = pytesseract.image_to_string(crop, config=config)

As you can see, I am passing in the directory of eng.traineddata explicitly but it can't find the language file.

I have two questions:

  1. How to improve the quality of the OCR with the first config file?
  2. Why can't the language file be found? I have eng.traineddata, eng.user-patterns, and eng.user-words in the mentioned folder, as well as some other files and folders that were installed there.

I have also made sure that my environment variables are correct (hence the first config file could work).

Thank you


Solution

  • When performing OCR, it is extremely important to preprocess the image before throwing it into Pytesseract. Specifically for this image, we can remove the horizontal and vertical grid lines. Here's the image after preprocessing:

    enter image description here

    Result from Pytesseract OCR

    XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX
    89 987 98 7 987 9 789 87 987 9
    978 9 78 978 9 789 78 987 9
    78 987 9 78 *978 97/8 %9 “78 978 9
    78 978 978 978 978 98 9
    78 978 978 978 978 978 987 978 7897
    978 9 9 78 9 89 98 978 9
    

    Code

    import cv2
    import pytesseract
    
    pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
    
    # Remove horizontal and vertical lines
    image = cv2.imread('1.png')
    kernel_vertical = cv2.getStructuringElement(cv2.MORPH_RECT, (1,50))
    temp1 = 255 - cv2.morphologyEx(image, cv2.MORPH_CLOSE, kernel_vertical)
    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50,1))
    temp2 = 255 - cv2.morphologyEx(image, cv2.MORPH_CLOSE, horizontal_kernel)
    temp3 = cv2.add(temp1, temp2)
    result = cv2.add(temp3, image)
    
    data = pytesseract.image_to_string(result, lang='eng',config='--psm 6')
    print(data)
    
    cv2.imshow('result', result)
    cv2.waitKey()