Search code examples
opencvimage-processingocrtesseractpython-tesseract

Reading text from a scanned image


I am trying to read text from a scanned image using the pytesseract library. Here is enter image description here

And this is how I'm trying to read it.


import matplotlib.pyplot as plt
import pytesseract
import cv2

# load the original image
image = cv2.imread("Figure_1.png")
plt.figure(figsize=(10,10))
plt.imshow(image)
# convert the image to black and white for better OCR
ret,thresh1 = cv2.threshold(image,120,255,cv2.THRESH_BINARY)

# pytesseract image to string to get results
text = str(pytesseract.image_to_string(image, config='--psm 6'))
print (text)

The output of my script populates is complete garbage. I'm trying to read the column called "Objekt". What am I missing?

Objekt
|
pe
eo
ee
eee

Solution

  • You may need some image preprocessing to be done and call tesseract with correct psm option. Results after image preprocessing and psm=3:

    Objekt
    V1.1
    V3.1
    V5.1
    V6.1
    Sp.2
    $.l6s1
    9005
    226
    216
    316
    136
    126
    9015
    1116
    Tp(231221/A)
    Tp(231301/0)
    Tp(A/1)
    Tp(1/3)
    Tp(1/5)
    Tp(3/011)
    Tp(3/6)
    To(D/6)
    Tp(5/021)
    Tp(5)
    Tp(011/01)
    Tp(6/0011)
    Tp(021/02)
    V2.1
    V2.2
    V2.3
    V4.1
    V4.2
    9006
    225
    215
    135
    125
    116
    115
    1125
    1115
    Tp(0011/0112)
    Tp(01/4)
    Tp(0112/4)
    Tp(4/012)
    Tp(02/022)
    Tp(012/014)
    Tp(022/2)
    Tp(014/2)
    Tp(2/B)
    Tp(B/231231)
    

    Image preprocessing steps:

    1. We have original image (table with large whitespace around it): enter image description here

    2. Binarize image with some adaptive threshold operation: enter image description here

    3. Extract largest contour (highlighted with green color): enter image description here

    4. Now we can crop a table:

    enter image description here

    1. Binarize cropped image:

    enter image description here

    1. Detect horizontal and vertical borders with some morphological operations

    Please refer to this tutorial which explains how it works: https://docs.opencv.org/3.4/dd/dd7/tutorial_morph_lines_detection.html

    enter image description here enter image description here

    1. Subtract detected horizontal and vertical masks from the original image and apply thresholding once more (final result):

    enter image description here

    1. call pytesseract.image_to_string with psm = 3
    3 = Fully automatic page segmentation, but no OSD. (Default)
    ...
    6 = Assume a single uniform block of text.
    

    You called with --psm 6 but you do not have a single uniform block of text but complex structured document with borders. So it is hard for algorithms to correctly detect text blobs and recognize characters in this case.

    Please refer to the documentation for more information on psm options: https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#options

    Complete example:

    import pytesseract
    import cv2
    import numpy as np
    
    
    original_image = cv2.imread("1.png", cv2.IMREAD_GRAYSCALE)
    
    binary_image = cv2.threshold(original_image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    
    # Find table borders
    contours, _ = cv2.findContours(binary_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    largest_contour = max(contours, key=cv2.contourArea)
    vimage = cv2.cvtColor(original_image, cv2.COLOR_GRAY2BGR)
    vimage = cv2.drawContours(vimage, [largest_contour], 0, (0, 255, 0), 1)
    
    x, y, w, h = cv2.boundingRect(largest_contour)
    cropped_image = original_image[y : y + h, x : x + w]
    
    # resize to width=600
    scale_factor = 600.0 / cropped_image.shape[1]
    cropped_image = cv2.resize(cropped_image, (0, 0), fx=scale_factor, fy=scale_factor, interpolation=cv2.INTER_LANCZOS4)
    
    mask = cv2.threshold(cropped_image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    
    height, width = mask.shape[:2]
    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (width // 2, 1))
    horizontal_kernel2 = cv2.getStructuringElement(cv2.MORPH_RECT, (width // 2, 3))
    horizontal_mask = cv2.erode(mask, horizontal_kernel)
    horizontal_mask = cv2.dilate(horizontal_mask, horizontal_kernel2, iterations=2)
    
    vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, height // 2))
    vertical_kernel2 = cv2.getStructuringElement(cv2.MORPH_RECT, (3, height // 2))
    vertical_mask = cv2.erode(mask, vertical_kernel)
    vertical_mask = cv2.dilate(vertical_mask, vertical_kernel2, iterations=3)
    
    hor_ver_mask = cv2.bitwise_or(horizontal_mask, vertical_mask)
    cropped_image[np.nonzero(hor_ver_mask)] = 255
    
    mask = cv2.threshold(cropped_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    
    text = pytesseract.image_to_string(mask, config="--psm 3").replace('\n\n', '\n')
    
    print(text)