python-3.x opencv image-processing tesseract

How to detect the boundaries of records in an image?

I have a huge number of JPEG images which are in high resolution (2500 x 3500 pixels) and are roughly in this shape:

Each of the numbers designate a separate record and my aim is to convert these to text.

I am aware of various OCR solutions such OpenCV or Tesseract, but my problem is in detecting the boundary of each record (so that later on, feed each one to the OCR). How can I achieve something like this:

Solution

Since every record starts with a blue number, you can threshold on blue-ish colors using the HSV color space to mask these texts. On that mask, use morphological closing, to get "boxes" from these blue texts. From that modified mask, find the contours, and determine their upper y coordinate. Extract the single records from the original image by slicing from one y coordinate to the next (+/- a few pixels) and using the full image width.

Here's some code for that:

import cv2
import numpy as np

# Read image
img = cv2.imread('CfOBO.png')

# Thresholding blue-ish colors using HSV color space
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
blue_lower = (90, 128, 64)
blue_upper = (135, 255, 192)
blue_mask = cv2.inRange(hsv, blue_lower, blue_upper)

# Morphological closing
blue_mask = cv2.morphologyEx(blue_mask, cv2.MORPH_CLOSE, np.ones((11, 11)))

# Find contours w.r.t. the OpenCV version
cnts = cv2.findContours(blue_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

# Get y coordinate from bounding rectangle for each contour
y = sorted([cv2.boundingRect(cnt)[1] for cnt in cnts])

# Manually add end of last record
y.append(img.shape[0])

# Extract records
records = [img[y[i]-5:y[i+1]-5, ...] for i in range(len(cnts))]

# Show records
for record in records:
    cv2.imshow('Record', record)
    cv2.waitKey(0)
cv2.destroyAllWindows()

There's plenty of room for optimization, e.g. if the last record has some large white space following. I just added the image bottom for the lower end of the last record. But, the general workflow should do what's desired. (I left out the following pytesseract stuff.)

----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.16299-SP0
Python:        3.9.1
NumPy:         1.20.1
OpenCV:        4.5.1
----------------------------------------