Search code examples
pythonopencvocropencv3.0python-tesseract

How to extract text from the highlighted text from an image


I have a code that highlights the user's name from an image, I want to extract text i.e users name from that image. Below is the code

import matplotlib.pyplot as plt
import cv2
import easyocr
from pylab import rcParams
from IPython.display import Image
rcParams['figure.figsize'] = 8, 16
reader = easyocr.Reader(['en'])
output = reader.readtext('MP-SAMPLE1.jpg')
cord = output[-106][0]
x_min, y_min = [int(min(idx)) for idx in zip(*cord)]
x_max, y_max = [int(max(idx)) for idx in zip(*cord)]

image = cv2.imread('MP-SAMPLE1.jpg')
cv2.rectangle(image,(x_min,y_min),(x_max,y_max),(0,0,255),2)
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

I have set coordinates according to my image, you can adjust it according to yours, I need to extract the text which is highlighted under the rectangular box. I am new in this field please ignore any mistake I would have done.

enter image description here

enter image description here


Solution

  • Here is my partial-solution for the problem.

    Since you are beginner, let me give you an advice, always start with pre-processing.

    Pre-processing will help you to remove the unwanted-artifacts.

    For instance you can do thresholding: Thresholding-result

    or median filtering: Median-filter result

    I used thresholding, then you can use pytesseract library. The library contains a lot of configuration options.

    Also for non-english languages, you can follow this tutorial.

    So, you want the text next to the FATHERS HUSBANDS. Therefore we could do

      1. Convert image to the text.

        • text = pytesseract.image_to_string(Image.open(f_name), lang='eng')
          
      1. From the text, find the equivalent of FATHERS HUSBANDS

        • for line in text.split('\n'):
              if "FATHERS HUSBANDS" in line:
                  name = line.split('.')[1].split(',')[0]
                  print(name)
          
        • Result:

          • GRAMONAN GROVER
            

    The last name is correct but first name is partially correct, it should be BRAJMONAN.

    I wrote this answer, hoping you to gude to your solution. Good luck.

    Code:


    import os
    import cv2
    import pytesseract
    
    from PIL import Image
    
    img = cv2.imread("FXSCh.jpg")
    gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    # threshold
    gry = cv2.threshold(gry, 0, 255,
                        cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
    
    f_name = "{}.png".format(os.getpid())
    cv2.imwrite(f_name, gry)
    
    text = pytesseract.image_to_string(Image.open(f_name), lang='eng')
    
    for line in text.split('\n'):
        if "FATHERS HUSBANDS" in line:
            name = line.split('.')[1].split(',')[0]
            print(name)
    
    os.remove(f_name)
    
    cv2.imshow("Image", img)
    cv2.imshow("Output", gry)
    cv2.waitKey(0)