Search code examples
pythonpython-tesseracttext-extraction

Extract text from Online image Url In python


I have written code based on the references found on web and some youtube videos but it doesnt seem to work for me and i am not understanding any further what could be the issue .

import io
import requests
import pytesseract
from PIL import Image

r = requests.get("http://www.teamjimmyjoe.com/wp-content/uploads/2014/09/Classic-Best-Funny-Text-Messages-earthquake-titties.jpg",stream=True)
# print( type(response) ) # <class 'requests.models.Response'>

img = Image.open(io.BytesIO(r.content))
# print( type(img) ) # <class 'PIL.JpegImagePlugin.JpegImageFile'>
text = pytesseract.image_to_string(img)

print(text)

i am getting this error

  File "F:\Projects\FileExtractor\untitled3.py", line 16, in <module>
    img = Image.open(io.BytesIO(r.content))

  File "C:\ProgramData\Anaconda3\lib\site-packages\PIL\Image.py", line 2943, in open
    raise UnidentifiedImageError(

UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x000001E85C0BAA40>

please help me with this issue . Thank you


Solution

  • Had a further thought. Why not spoof the browser headers, this now appears to work.

    import io
    import requests
    import pytesseract
    from PIL import Image
    
    url = 'http://www.teamjimmyjoe.com/wp-content/uploads/2014/09/Classic-Best-Funny-Text-Messages-earthquake-titties.jpg'
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    
    r = requests.get(url, headers=headers)
    
    img = Image.open(io.BytesIO(r.content))
    # # print( type(img) ) # <class 'PIL.JpegImagePlugin.JpegImageFile'>
    text = pytesseract.image_to_string(img)
    #
    print(text)
    

    Response is:

    Hey! | just saw on CNN
    
    there was an earthquake
    near you. Are you ok?
    
     
    
    | Yes! We're all fine!
    
    What did it rate.on the titty
    scale?
    
    | Well they only jiggled a |
    
     
    
    little bit, so probably not
    
    that high.
    
    HAHAHAHAHAHA | LOVE
    YOU
    
    Richter scale. My phone is |
    a 12 yr old boy.
    
    —————————r