Search code examples
pythonopencvocrtesseractbounding-box

Delete OCR word from Image (OpenCV,Python)


So, from what I can begin..

I am working with OCR. The script works pretty well for what I need. It detects the words with an accuracy which for me is ok.

This is the result: 100% accuracy with attached image.

syd

from PIL import Image
import pyocr.builders
import os

os.putenv("TESSDATA_PREFIX", "C:\\Program Files (x86)\\Tesseract-OCR")

tools = pyocr.get_available_tools()
tool = tools[0]
langs = tool.get_available_languages()
lang = langs[0] #eng

file = "test.png"

txt = tool.image_to_string(Image.open(file), lang=lang, builder=pyocr.builders.TextBuilder())
print(txt + '\n')

'''
word = ['SHINE','ON','YOU','CRAZY','DIAMOND','SYD']

if word[2] in txt:
    print("## WORD IN LIST ##")
else:
    print("## NOT IN LIST ##")'''

Now the question: how can I remove from image a word which exist in the output OCR-list (in the code named txt) ? I mean, if the word SHINE exist as output in console (and in list), how can I delete it in image ? Or, if not remove, create a mask so I can hide it...

I think the ocr work by selecting areas of text and creating a bounding box around the text. In this case, how to delete (or even show) this ROI/bounding box ? In the pyocr documentation there are some hints about this function (show bounding box) but I don't know how to use it.

Any help/hint is appreciated.

Thanks

EDIT: this code show me the bounding box for each character

import csv
import cv2
from pytesseract import pytesseract as pt

pt.run_tesseract('test.png', 'output', lang=None, boxes=True, config="hocr")

# To read the coordinates
boxes = []
with open('output.box', 'rt') as f:
    reader = csv.reader(f, delimiter = ' ')
    for row in reader:
        if len(row) == 6:
            boxes.append(row)

# Draw the bounding box
img = cv2.imread('test.png')
h, w, _ = img.shape
for b in boxes:
    img = cv2.rectangle(img,(int(b[1]),h-int(b[2])),(int(b[3]),h-int(b[4])),(255,0,0),2)

cv2.imshow('output', img)
cv2.waitKey(0)

sdsdf

How can I tell it to show me only the first (whole) word ?


Solution

  • Here's a simple approach

    • Convert image to grayscale
    • Otsu's threshold
    • Dilate to connect contours
    • Find contours and extract ROI for each word
    • Perform OCR and remove word

    After converting to grayscale, we Otsu's threshold to obtain a binary image

    enter image description here

    Next we invert the image and dilate to form a single contour for each word

    enter image description here

    From here we find contours and extract the ROI for each word. Here's the detected ROIs

    enter image description here

    We throw each ROI into Pytesseract OCR. If the OCR result is a word we want to remove, we simply "delete" the word by filling in the ROI with white and replace it in the original image


    With

    words_to_remove = ['on', 'you', 'crazy']
    

    The result is

    enter image description here

    Similarly with

    words_to_remove = ['on', 'you', 'shine', 'diamond']
    

    The result is

    enter image description here

    Finally with

    words_to_remove = ['on', 'you', 'crazy', 'diamond']
    

    enter image description here

    import cv2
    import pytesseract
    
    words_to_remove = ['on', 'you', 'crazy', 'diamond']
    pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
    
    image = cv2.imread("1.png")
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))
    inverted_thresh = 255 - thresh
    dilate = cv2.dilate(inverted_thresh, kernel, iterations=4)
    
    cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    for c in cnts:
        x,y,w,h = cv2.boundingRect(c)
        ROI = thresh[y:y+h, x:x+w]
        data = pytesseract.image_to_string(ROI, lang='eng',config='--psm 6').lower()
        if data in words_to_remove:
            image[y:y+h, x:x+w] = [255,255,255]
    
    cv2.imshow("thresh", thresh)
    cv2.imshow("dilate", dilate)
    cv2.imshow("image", image)
    cv2.waitKey(0)