Search code examples
pythonopencvmachine-learningcomputer-visionpython-tesseract

Open CV OCR improve data extraction from color image with background


I am trying to extract some info from mobile screen shots. Though my code is able to retrieve some info , but not all of it. I read the image converted to grey , then removed non required parts and applied Gaussian Threshold. But the entire text is not getting read.

import numpy as np
import cv2
from PIL import Image
import matplotlib.pyplot as plt
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\\Installs\\Tools\\Tesseract-OCR\\tesseract.exe'

image = "C:\\Workspace\\OCR\\tesseract\\rpstocks1 - Copy (2).png"
img = cv2.imread(image)
img_grey = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)

height, width, channels = img.shape
print (height, width, channels)


rec_img=cv2.rectangle(img_grey,(30,100),(1040,704),(0,255,0),3).copy()

crop_img = rec_img[105:1945, 35:1035].copy()
cv2.medianBlur(img,5)
cv2.imwrite("C:\\Workspace\\OCR\\tesseract\\Cropped_GREY.jpg",crop_img)

img_gauss = cv2.adaptiveThreshold(crop_img,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY,11,12)
cv2.imwrite("C:\\Workspace\\OCR\\tesseract\\Cropped_Guass.jpg",img_gauss)
text = pytesseract.image_to_string(img_gauss, lang='eng')
text.encode('utf-8')
print(text)

Output

Image Dimensions 704 1080 3

Investing

$9,712.99 
ASRT _ 0
500.46 shares  ......... ..  /0 
GNUS 
25169 Shares  """"" " ‘27.98%

rpstocks1 - Copy (2).png rpstocks1 - Copy (2).png Cropped_GREY.jpg Cropped_GREY.jpg Cropped_Guass.jpg Cropped_Guass.jpg


Solution

  • Have a look at the page segmentation modes of pytesseract, cf. this Q&A. For example, using config='-psm 12' will already give you all desired texts. Nevertheless, those graphs are also somehow interpreted as texts.

    That's why I would preprocess the image to get single boxes (actual texts, the graphs, those information from the top, etc.), and filter to only store those boxes with the content of interest. That could be done by using

    • the y coordinate of the bounding rectangle (not in the upper 5 % of the image, that's the mobile phone status bar),
    • the width w of the bounding rectangle (not wider than 50 % of the image' width, these are the horizontal lines),
    • the x coordinate of the bounding rectangle (not in middle third of the image, these are the graphs).

    What's left is to run pytesseract on each cropped image with config='-psm 6' for example (assume a single uniform block of text), and clean the texts from any line breaks.

    That'd be my code:

    import cv2
    import pytesseract
    
    # Read image
    img = cv2.imread('cUcby.png')
    hi, wi = img.shape[:2]
    
    # Convert to grayscale for tesseraact
    img_grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    # Mask single boxes by thresholding and morphological closing in x diretion
    mask = cv2.threshold(img_grey, 248, 255, cv2.THRESH_BINARY_INV)[1]
    mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE,
                            cv2.getStructuringElement(cv2.MORPH_RECT, (51, 1)))
    
    # Find contours w.r.t. the OpenCV version
    cnts = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    
    # Get bounding rectangles
    rects = [cv2.boundingRect(cnt) for cnt in cnts]
    
    # Filter bounding rectangles:
    # - not in the upper 5 % of the image (mobile phone status bar)
    # - not wider than 50 % of the image' width (horizontal lines)
    # - not being in the middle third of the image (graphs)
    rects = [(x, y, w, h) for x, y, w, h in rects if
             (y > 0.05 * hi) and
             (w <= 0.5 * wi) and
             ((x < 0.3333 * wi) or (x > 0.6666 * wi))]
    
    # Sort bounding rectangles first by y coordinate, then by x coordinate
    rects = sorted(rects, key=lambda x: (x[1], x[0]))
    
    # Get texts from bounding rectangles from pytesseract
    texts = [pytesseract.image_to_string(
        img_grey[y-1:y+h+1, x-1:x+w+1], config='-psm 6') for x, y, w, h in rects]
    
    # Remove line breaks
    texts = [text.replace('\n', '') for text in texts]
    
    # Output
    print(texts)
    

    And, that's the output:

    ['Investing', '$9,712.99', 'ASRT', '-27.64%', '500.46 shares', 'GNUS', '-27.98%', '251.69 shares']
    

    Since you have the locations of the bounding rectangles, you could also re-arrange the whole text using that information.

    ----------------------------------------
    System information
    ----------------------------------------
    Platform:      Windows-10-10.0.16299-SP0
    Python:        3.9.1
    PyCharm:       2021.1.1
    OpenCV:        4.5.1
    pytesseract:   4.00.00alpha
    ----------------------------------------