Search code examples
pythonimage-processingocrtesseractpython-tesseract

Python - pytesseract not consistent for similar images


For example this image returns Sieteary ear

enter image description here

While this image returns the correct answer

enter image description here

The only difference between the 2 images is 2 pixels in the height.

I have tried applying some threshold but didnt seem to help...

from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = Image.open(path)
print(pytesseract.image_to_string(image, lang='eng'))

Solution

  • You can perform some preprocessing using OpenCV. The idea is to enlarge the image with imutils, obtain a binary image using Otsu's threshold, then add a slight Gaussian blur. For optimal detection, the image should be in the form where desired text to be detected is in black with the background in white. Here's the preprocessing results for the two images:

    Before -> After

    enter image description here

    enter image description here

    The output result from Pytesseract for both images are the same

    BigBootyHunter2
    

    Code

    import cv2
    import pytesseract
    import imutils
    
    pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
    
    image = cv2.imread('1.jpg')
    image = imutils.resize(image, width=500)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    thresh = cv2.GaussianBlur(thresh, (3,3), 0)
    data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
    print(data)
    
    cv2.imshow('thresh', thresh)
    cv2.waitKey()