Search code examples
pythonopencvmachine-learningnlptesseract

How to remove unwanted text extracted from the image?


I am working on a project called Business Card Scanner. I am extracting text from the image using pytesseract and then classifying the obtained text using regex and other techniques.

Whenever there is a logo in an image, tesseract consider it as a text and tries to read it. This results in a meaningless text. Consider the example of an image below:

IMG

Here is what I have tried to extract the text:

# Google colab
# read required libraries
img = cv2.imread("img2.jpg")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (5,5),0)
ret3, thresh = cv2.threshold(blur,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)
cv2_imshow(thresh)
text = pytesseract.image_to_string(thresh, lang = 'eng'); print(text)

This is what I get as output when I run the above code:

: eM , NOEs Efe: Mb fes fe y Ky TEP ON PILLS cag
gy: Ye Ws My Wii WL, FLY T by,

i igs Mg ER te EB iy MY, Gee.
: WO Ee as _ he i. "4 ‘; y sen “iy ye age i ‘ el HY tiber My, ee ered fi! ", ty Mf

Mm Gujarat TE og
: , fp bet
(x = Technological ( Wy, ey,

sae ae e . Tf) :
wage University ~~ es

e e é et

ikhil Suthar lees
fy Lg. Z - “fe " ‘Sa
. ve 7, of

Regional Coordinator - OSD MWe) Dh
ye

Mob. <hidden>

Email : <hidden>

Govt. Technical High School Campus, Near Aurobindo
Ashram Dandia Bazar,Vadodara - 390001, Gujarat, India
www.gtu.ac.in | www-gtuinnovationcouncil.2¢.in

i Ae

; ew, OD
t eS ft me ' @
ate
ary ya
j my

ee |
a

Is there a way I can remove this unwanted text that is produced due to logo (that's what I think)? Please let me know if my question require other information.


Solution

  • The background of the image is the problem. You can omit by selecting a height-range

    For example: If you select the height-range between: h/4 - (3*h)/4, result will be: (image is resized due to exceeding 2MiB.)

    enter image description here

    When you read:

    Nikhil Suthar
    Regional Coordinator - OSD
    
    Email | Mob. |
    
    Govt. Technical High School Campus, Near Aurobindo
    Ashram Dandia Bazar, Vadodara - 390001, Gujarat, India
    www.gtu.ac.in | www.gtuinnovationcouncil.ac.in
    

    Code:


    import cv2
    from pytesseract import image_to_string
    
    img = cv2.imread("Oa9svHu.jpeg")
    gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    (h, w) = gry.shape[:2]
    gry = gry[int(h/4):int((3*h)/4), 0:w]
    txt = image_to_string(gry)
    print(txt.strip())
    cv2.imshow("gry", gry)
    cv2.waitKey(0)