I am trying to read text from a scanned image using the pytesseract library. Here is
And this is how I'm trying to read it.
import matplotlib.pyplot as plt
import pytesseract
import cv2
# load the original image
image = cv2.imread("Figure_1.png")
plt.figure(figsize=(10,10))
plt.imshow(image)
# convert the image to black and white for better OCR
ret,thresh1 = cv2.threshold(image,120,255,cv2.THRESH_BINARY)
# pytesseract image to string to get results
text = str(pytesseract.image_to_string(image, config='--psm 6'))
print (text)
The output of my script populates is complete garbage. I'm trying to read the column called "Objekt". What am I missing?
Objekt
|
pe
eo
ee
eee
You may need some image preprocessing to be done and call tesseract with correct psm
option. Results after image preprocessing and psm=3:
Objekt
V1.1
V3.1
V5.1
V6.1
Sp.2
$.l6s1
9005
226
216
316
136
126
9015
1116
Tp(231221/A)
Tp(231301/0)
Tp(A/1)
Tp(1/3)
Tp(1/5)
Tp(3/011)
Tp(3/6)
To(D/6)
Tp(5/021)
Tp(5)
Tp(011/01)
Tp(6/0011)
Tp(021/02)
V2.1
V2.2
V2.3
V4.1
V4.2
9006
225
215
135
125
116
115
1125
1115
Tp(0011/0112)
Tp(01/4)
Tp(0112/4)
Tp(4/012)
Tp(02/022)
Tp(012/014)
Tp(022/2)
Tp(014/2)
Tp(2/B)
Tp(B/231231)
Image preprocessing steps:
Please refer to this tutorial which explains how it works: https://docs.opencv.org/3.4/dd/dd7/tutorial_morph_lines_detection.html
pytesseract.image_to_string
with psm
= 33 = Fully automatic page segmentation, but no OSD. (Default)
...
6 = Assume a single uniform block of text.
You called with --psm 6
but you do not have a single uniform block of text but complex structured document with borders. So it is hard for algorithms to correctly detect text blobs and recognize characters in this case.
Please refer to the documentation for more information on psm options: https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#options
Complete example:
import pytesseract
import cv2
import numpy as np
original_image = cv2.imread("1.png", cv2.IMREAD_GRAYSCALE)
binary_image = cv2.threshold(original_image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Find table borders
contours, _ = cv2.findContours(binary_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
largest_contour = max(contours, key=cv2.contourArea)
vimage = cv2.cvtColor(original_image, cv2.COLOR_GRAY2BGR)
vimage = cv2.drawContours(vimage, [largest_contour], 0, (0, 255, 0), 1)
x, y, w, h = cv2.boundingRect(largest_contour)
cropped_image = original_image[y : y + h, x : x + w]
# resize to width=600
scale_factor = 600.0 / cropped_image.shape[1]
cropped_image = cv2.resize(cropped_image, (0, 0), fx=scale_factor, fy=scale_factor, interpolation=cv2.INTER_LANCZOS4)
mask = cv2.threshold(cropped_image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
height, width = mask.shape[:2]
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (width // 2, 1))
horizontal_kernel2 = cv2.getStructuringElement(cv2.MORPH_RECT, (width // 2, 3))
horizontal_mask = cv2.erode(mask, horizontal_kernel)
horizontal_mask = cv2.dilate(horizontal_mask, horizontal_kernel2, iterations=2)
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, height // 2))
vertical_kernel2 = cv2.getStructuringElement(cv2.MORPH_RECT, (3, height // 2))
vertical_mask = cv2.erode(mask, vertical_kernel)
vertical_mask = cv2.dilate(vertical_mask, vertical_kernel2, iterations=3)
hor_ver_mask = cv2.bitwise_or(horizontal_mask, vertical_mask)
cropped_image[np.nonzero(hor_ver_mask)] = 255
mask = cv2.threshold(cropped_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
text = pytesseract.image_to_string(mask, config="--psm 3").replace('\n\n', '\n')
print(text)