Search code examples
pythonpython-3.xweb-scrapingpython-imaging-librarypython-tesseract

Unable to extract a word out of an image


I've written a script in python in combination with pytesseract to extract a word out of an image. There is only a single word TOOLS available in that image and that is what I'm after. Currently my below script is giving me wrong output which is WIS. What Can I do to get the text?

Link to that image

This is my script:

import requests, io, pytesseract
from PIL import Image

response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))
img = img.resize([100,100], Image.ANTIALIAS)
img = img.convert('L')
img = img.point(lambda x: 0 if x < 170 else 255)
imagetext = pytesseract.image_to_string(img)
print(imagetext)
# img.show()

This is the status of the modified image when I run the above script:

enter image description here

The output I'm having:

WIS

Expected output:

TOOLS

Solution

  • The key is matching image transformation to the tesseract abilities. Your main problem is that the font is not a usual one. All you need is

    from PIL import Image, ImageEnhance, ImageFilter
    
    response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
    img = Image.open(io.BytesIO(response.content))
    
    # remove texture
    enhancer = ImageEnhance.Color(img)
    img = enhancer.enhance(0)   # decolorize
    img = img.point(lambda x: 0 if x < 250 else 255) # set threshold
    img = img.resize([300, 100], Image.LANCZOS) # resize to remove noise
    img = img.point(lambda x: 0 if x < 250 else 255) # get rid of remains of noise
    # adjust font weight
    img = img.filter(ImageFilter.MaxFilter(11)) # lighten the font ;)
    imagetext = pytesseract.image_to_string(img)
    print(imagetext)
    

    And voila,

    TOOLS
    

    are recognized.