Search code examples
pythonocrpython-tesseract

How to eliminate certain words of particular height on image in tesseract ocr?


I want to delete letters marked in red box

I am getting that letters as Junk in the output, so I want to delete that words in that images to get good output. Please help me to remove the letters/words of that height in the image using any image processing/tesseract /open cv techniques.


Solution

  • We may find contours, find the bounding rectangle of each contour, and fill "short" contours with black color (using OpenCV package).

    For getting better results:

    • Apply thresholding before calling cv2.findContours.
    • Fill the bounding rectangle of each contour with small margins.

    Code sample:

    import cv2
    import pytesseract
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # May be required when using Windows
    
    img = cv2.imread('words.png', cv2.IMREAD_GRAYSCALE)  # Read image in grayscale format
    
    # Apply thresholding (use `cv2.THRESH_OTSU` for automatic thresholding)
    thresh = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU)[1]  # We need this stage, because not all pixels are 0 or 255 values.
    
    contours = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[0]  # Find contours
    letter_h_thresh = 30  # Letter with smaller height are considered to be "junk".
    for c in contours:
        x, y, w, h = cv2.boundingRect(c)  # Compute the bounding rectangle
        if h < letter_h_thresh:
            img[y-2:y+h+2, x-2:x+w+2] = 0  # Fill the bounding rectangle with zeros (with some margins)
    
    # Pass preprocessed image to pytesseract
    text = pytesseract.image_to_string(img, config="--psm 6")
    print("Text found: " + text)  # Text found: SCHENGEN
    
    cv2.imwrite('img.png', img)  # Save img for testing
    

    Input image words.png (removed some of your red markings):
    enter image description here

    Output image img.png (used as input to pytesseract):
    enter image description here


    In case there splitted letters like i letter, we may solve it using morphological operations, ot use different approach.

    Use pytesseract.image_to_data for splitting the image into text boxes:

    import cv2
    import pytesseract
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # May be required when using Windows
    
    img = cv2.imread('words.png', cv2.IMREAD_GRAYSCALE)  # Read image in grayscale format
    
    thresh = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU)[1]  # We need this stage, because not all pixels are 0 or 255 values.
    
    letter_h_thresh = 30  # Letter with smaller height are considered to be "junk".
    d = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT, config="--psm 6")
    n_boxes = len(d['level'])
    for i in range(n_boxes):
        if d['word_num'][i] > 0:
            (x0, y0, w0, h0) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
            #cv2.rectangle(img, (x, y), (x + w, y + h), 128, 2)
            roi = thresh[y0-2:y0+h0+2, x0-2:x0+w0+2]
            img_roi = img[y0-2:y0+h0+2, x0-2:x0+w0+2]  # Slice in img
            contours = cv2.findContours(roi, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[0]  # Find contours in roi
            for c in contours:
                x, y, w, h = cv2.boundingRect(c)  # Compute the bounding rectangle
                if h < letter_h_thresh:
                    img_roi[y:y+h, x:x+w] = 0  # Fill the bounding rectangle with zeros (in img_roi slice)
    
    text = pytesseract.image_to_string(img, config="--psm 6")
    print("Text found: " + text)  # Text found: SCHENGEN
    
    cv2.imwrite('img.png', img)  # Save img for testing