python opencv tesseract python-tesseract

How to transcript text from image in the highlighted areas?

How can I transcript the text from the highlighted areas from the following image with Tesseract in Python?

Solution

Assuming you have a distinct color for the highlighted areas, which isn't present in the remaining image – like the prominent red color for the highlighting in your example – you can use color thresholding using the HSV color space incorporating cv2.inRange.

Therefore, you set up proper lower and upper limits for hue, saturation, and value. In the given example, we're detecting red-ish colors. So, in general, we would need two sets of limits, since red-ish colors are at the 0°/180° "turnaround" of the hue cylinder. To overcome that, and only use one set of limits, we shift the obtained hue channel by 90°, and take the modulo of 180°. Also, we have high satured, and quite bright red-ish colors, so we might look at saturation levels above 80 %, and value levels above 50 %. We get such a mask:

Last thing to do is to obtain the contours from the generated mask, get the corresponding bounding rectangles, and run pytesseract on the content (grayscaled, thresholded using Otsu for better OCR performance). My suggestion would be to also use the -psm 6 option here.

Here's the full code including the results:

import cv2
import numpy as np
import pytesseract

# Read image
img = cv2.imread('E5PY2.jpg')

# Convert to HSV color space, and split channels
h, s, v = cv2.split(cv2.cvtColor(img, cv2.COLOR_BGR2HSV))

# Shift hue channel to detect red area using only one range
h_2 = ((h.astype(int) + 90) % 180).astype(h.dtype)

# Mask highlighted boxes using color thresholding
lower = np.array([ 70, int(0.80 * 255), int(0.50 * 255)])
upper = np.array([110, int(1.00 * 255), int(1.00 * 255)])
highlighted = cv2.inRange(cv2.merge([h_2, s, v]), lower, upper)

# Find contours w.r.t. the OpenCV version; retrieve bounding rectangles
cnts = cv2.findContours(highlighted, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
rects = [cv2.boundingRect(cnt) for cnt in cnts]

# Iterate bounding boxes, and OCR
for x, y, w, h in rects:

    # Grayscale, and threshold using Otsu
    work = cv2.cvtColor(img[y:y+h, x:x+w], cv2.COLOR_BGR2GRAY)
    work = cv2.threshold(work, 0, 255, cv2.THRESH_OTSU)[1]

    # Pytesseract with -psm 6
    text = pytesseract.image_to_string(work, config='--psm 6')\
        .replace('\n', '').replace('\f', '')
    print('X: {}, Y: {}, Text: {}'.format(x, y, text))
    # X: 468, Y: 1574, Text: START MEDITATING
    # X: 332, Y: 1230, Text: Well done. By signing up, you’ve taken your first
    # X: 358, Y: 182, Text: Welcome

Caveat: I use a special version of Tesseract from the Mannheim University Library.

----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.19041-SP0
Python:        3.9.1
PyCharm:       2021.1.1
NumPy:         1.20.3
OpenCV:        4.5.2
pytesseract:   5.0.0-alpha.20201127
----------------------------------------