I am trying to extract text from a simple image.
When I use the default engine (oem 3), the text is extracted (poorly). I would like to use the other engines (oem 2) to see if the output can improve.
import pytesseract
#this is the config that gives a poor output
config = '--tessdata-dir "C:/Program Files/Tesseract-OCR/tessdata" -l eng --oem 2 --psm 6'
text = pytesseract.image_to_string(crop, config=config)
When I try and pass the option to change the engine I get an error, saying that the language files aren't found:
pytesseract.pytesseract.TesseractError: (1, "Error: Tesseract (legacy) engine requested, but components are not present in C:/Program Files/Tesseract-OCR/tessdata/eng.traineddata!! Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.")
#experimental config
config = '--psm 6'
text = pytesseract.image_to_string(crop, config=config)
As you can see, I am passing in the directory of eng.traineddata explicitly but it can't find the language file.
I have two questions:
I have also made sure that my environment variables are correct (hence the first config file could work).
Thank you
When performing OCR, it is extremely important to preprocess the image before throwing it into Pytesseract. Specifically for this image, we can remove the horizontal and vertical grid lines. Here's the image after preprocessing:
Result from Pytesseract OCR
XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX
89 987 98 7 987 9 789 87 987 9
978 9 78 978 9 789 78 987 9
78 987 9 78 *978 97/8 %9 “78 978 9
78 978 978 978 978 98 9
78 978 978 978 978 978 987 978 7897
978 9 9 78 9 89 98 978 9
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Remove horizontal and vertical lines
image = cv2.imread('1.png')
kernel_vertical = cv2.getStructuringElement(cv2.MORPH_RECT, (1,50))
temp1 = 255 - cv2.morphologyEx(image, cv2.MORPH_CLOSE, kernel_vertical)
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50,1))
temp2 = 255 - cv2.morphologyEx(image, cv2.MORPH_CLOSE, horizontal_kernel)
temp3 = cv2.add(temp1, temp2)
result = cv2.add(temp3, image)
data = pytesseract.image_to_string(result, lang='eng',config='--psm 6')
print(data)
cv2.imshow('result', result)
cv2.waitKey()