Search code examples
pythontesseractpython-tesseract

Unable to OCR alphanumerical image with Tesseract


I'm trying to read some alphanumerical strings in python with pytesseract. I pre-process the images to reduce noise and make them black and white, but I consistently have issues reading the digits inside the string.

original: original image

after cleanup: image after cleanup

Extracted text: WISOMW

Code used:

def convert(path):    
    image = cv2.imread(path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (3, 3), 0)
    thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    invert = 255 - thresh
    cv2.imwrite("processed.jpg", invert)

    # Perform text extraction
    return pytesseract.image_to_string(invert, config="--psm 7")

I've tried different configuration options for tesseract:

  • oem: tried 1, 3
  • psm: tried different modes
  • tessedit_char_whitelist: limited to alphanumerical characters

I feel I'm missing something obvious given that it reliably reads the alpha characters. Any ideas of what can it be?


Solution

  • You were so close. A dilate helps increase white/decrease black. The resolution is low, so a small kernel is used for dilate. If you remove the _INV from your threshold step, you don't need to do another inversion.

    import cv2
    import numpy as np
    import pytesseract
    
    img = cv2.imread('wis9mw.jpg', cv2.IMREAD_GRAYSCALE )
    
    img = cv2.GaussianBlur(img, (3, 3), 0)
    img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    
    kernel = np.ones((1,1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    
    cv2.imwrite('processed.jpg', img)
    
    text = pytesseract.image_to_string(img, config="--psm 6")
    print(text)
    

    gives

    WIS9MW
    

    enter image description here