opencv image-processing ocr tesseract python-tesseract

Reading text from a scanned image

I am trying to read text from a scanned image using the pytesseract library. Here is

And this is how I'm trying to read it.


import matplotlib.pyplot as plt
import pytesseract
import cv2

# load the original image
image = cv2.imread("Figure_1.png")
plt.figure(figsize=(10,10))
plt.imshow(image)
# convert the image to black and white for better OCR
ret,thresh1 = cv2.threshold(image,120,255,cv2.THRESH_BINARY)

# pytesseract image to string to get results
text = str(pytesseract.image_to_string(image, config='--psm 6'))
print (text)

The output of my script populates is complete garbage. I'm trying to read the column called "Objekt". What am I missing?

Objekt
|
pe
eo
ee
eee

Solution

You may need some image preprocessing to be done and call tesseract with correct psm option. Results after image preprocessing and psm=3:

Objekt
V1.1
V3.1
V5.1
V6.1
Sp.2
$.l6s1
9005
226
216
316
136
126
9015
1116
Tp(231221/A)
Tp(231301/0)
Tp(A/1)
Tp(1/3)
Tp(1/5)
Tp(3/011)
Tp(3/6)
To(D/6)
Tp(5/021)
Tp(5)
Tp(011/01)
Tp(6/0011)
Tp(021/02)
V2.1
V2.2
V2.3
V4.1
V4.2
9006
225
215
135
125
116
115
1125
1115
Tp(0011/0112)
Tp(01/4)
Tp(0112/4)
Tp(4/012)
Tp(02/022)
Tp(012/014)
Tp(022/2)
Tp(014/2)
Tp(2/B)
Tp(B/231231)

Image preprocessing steps:

We have original image (table with large whitespace around it):
Binarize image with some adaptive threshold operation:
Extract largest contour (highlighted with green color):
Now we can crop a table:

Binarize cropped image:

Detect horizontal and vertical borders with some morphological operations

Please refer to this tutorial which explains how it works: https://docs.opencv.org/3.4/dd/dd7/tutorial_morph_lines_detection.html

Subtract detected horizontal and vertical masks from the original image and apply thresholding once more (final result):

call pytesseract.image_to_string with psm = 3

3 = Fully automatic page segmentation, but no OSD. (Default)
...
6 = Assume a single uniform block of text.

You called with --psm 6 but you do not have a single uniform block of text but complex structured document with borders. So it is hard for algorithms to correctly detect text blobs and recognize characters in this case.

Please refer to the documentation for more information on psm options: https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#options

Complete example:

import pytesseract
import cv2
import numpy as np


original_image = cv2.imread("1.png", cv2.IMREAD_GRAYSCALE)

binary_image = cv2.threshold(original_image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Find table borders
contours, _ = cv2.findContours(binary_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
largest_contour = max(contours, key=cv2.contourArea)
vimage = cv2.cvtColor(original_image, cv2.COLOR_GRAY2BGR)
vimage = cv2.drawContours(vimage, [largest_contour], 0, (0, 255, 0), 1)

x, y, w, h = cv2.boundingRect(largest_contour)
cropped_image = original_image[y : y + h, x : x + w]

# resize to width=600
scale_factor = 600.0 / cropped_image.shape[1]
cropped_image = cv2.resize(cropped_image, (0, 0), fx=scale_factor, fy=scale_factor, interpolation=cv2.INTER_LANCZOS4)

mask = cv2.threshold(cropped_image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

height, width = mask.shape[:2]
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (width // 2, 1))
horizontal_kernel2 = cv2.getStructuringElement(cv2.MORPH_RECT, (width // 2, 3))
horizontal_mask = cv2.erode(mask, horizontal_kernel)
horizontal_mask = cv2.dilate(horizontal_mask, horizontal_kernel2, iterations=2)

vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, height // 2))
vertical_kernel2 = cv2.getStructuringElement(cv2.MORPH_RECT, (3, height // 2))
vertical_mask = cv2.erode(mask, vertical_kernel)
vertical_mask = cv2.dilate(vertical_mask, vertical_kernel2, iterations=3)

hor_ver_mask = cv2.bitwise_or(horizontal_mask, vertical_mask)
cropped_image[np.nonzero(hor_ver_mask)] = 255

mask = cv2.threshold(cropped_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

text = pytesseract.image_to_string(mask, config="--psm 3").replace('\n\n', '\n')

print(text)