Search code examples
pythonopencvtesseractpython-tesseract

How to transcript text from image in the highlighted areas?


How can I transcript the text from the highlighted areas from the following image with Tesseract in Python?

Input image


Solution

  • Assuming you have a distinct color for the highlighted areas, which isn't present in the remaining image – like the prominent red color for the highlighting in your example – you can use color thresholding using the HSV color space incorporating cv2.inRange.

    Therefore, you set up proper lower and upper limits for hue, saturation, and value. In the given example, we're detecting red-ish colors. So, in general, we would need two sets of limits, since red-ish colors are at the 0°/180° "turnaround" of the hue cylinder. To overcome that, and only use one set of limits, we shift the obtained hue channel by 90°, and take the modulo of 180°. Also, we have high satured, and quite bright red-ish colors, so we might look at saturation levels above 80 %, and value levels above 50 %. We get such a mask:

    Mask

    Last thing to do is to obtain the contours from the generated mask, get the corresponding bounding rectangles, and run pytesseract on the content (grayscaled, thresholded using Otsu for better OCR performance). My suggestion would be to also use the -psm 6 option here.

    Here's the full code including the results:

    import cv2
    import numpy as np
    import pytesseract
    
    # Read image
    img = cv2.imread('E5PY2.jpg')
    
    # Convert to HSV color space, and split channels
    h, s, v = cv2.split(cv2.cvtColor(img, cv2.COLOR_BGR2HSV))
    
    # Shift hue channel to detect red area using only one range
    h_2 = ((h.astype(int) + 90) % 180).astype(h.dtype)
    
    # Mask highlighted boxes using color thresholding
    lower = np.array([ 70, int(0.80 * 255), int(0.50 * 255)])
    upper = np.array([110, int(1.00 * 255), int(1.00 * 255)])
    highlighted = cv2.inRange(cv2.merge([h_2, s, v]), lower, upper)
    
    # Find contours w.r.t. the OpenCV version; retrieve bounding rectangles
    cnts = cv2.findContours(highlighted, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    rects = [cv2.boundingRect(cnt) for cnt in cnts]
    
    # Iterate bounding boxes, and OCR
    for x, y, w, h in rects:
    
        # Grayscale, and threshold using Otsu
        work = cv2.cvtColor(img[y:y+h, x:x+w], cv2.COLOR_BGR2GRAY)
        work = cv2.threshold(work, 0, 255, cv2.THRESH_OTSU)[1]
    
        # Pytesseract with -psm 6
        text = pytesseract.image_to_string(work, config='--psm 6')\
            .replace('\n', '').replace('\f', '')
        print('X: {}, Y: {}, Text: {}'.format(x, y, text))
        # X: 468, Y: 1574, Text: START MEDITATING
        # X: 332, Y: 1230, Text: Well done. By signing up, you’ve taken your first
        # X: 358, Y: 182, Text: Welcome
    

    Caveat: I use a special version of Tesseract from the Mannheim University Library.

    ----------------------------------------
    System information
    ----------------------------------------
    Platform:      Windows-10-10.0.19041-SP0
    Python:        3.9.1
    PyCharm:       2021.1.1
    NumPy:         1.20.3
    OpenCV:        4.5.2
    pytesseract:   5.0.0-alpha.20201127
    ----------------------------------------