image python-3.x request python-imaging-library ocr

Opening Image file from url with PIL for text recognition with pytesseract

I am facing a confusing problem trying to download image and open it with BytesIO in order to extract text from it using PIL & pytesseract.

>>> response = requests.get('http://abc/images/im.jpg')
>>> img = Image.open(BytesIO(response.content))
>>> img
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=217x16 at 0x7FDAD185CB38>
>>> text = pytesseract.image_to_string(img)
>>> text
''

Here it gives an empty string.

However if i save the image and then open it again with pytesseract, it gives the right result.

>>> img.save('im1.jpg')
>>> im = Image.open('im1.jpg')
>>> pytesseract.image_to_string(im)
'The right text'

And just to confirm, both give same size.

>>> im.size
(217, 16)
>>> img.size
(217, 16)

What can be the problem? Is it necessary to save the image or am I doing something wrong?

Solution

You seem to have a problem which I can't reproduce. So to diagnose your problem, if there is any, were much more details necessary, BUT instead of asking for details I just assume (so my overall experience) that in the process of giving the details your problem will vanish and can't be reproduced. This way is this answer a solution to your problem.

In case it is not, let know if you need further assistance. At least you can be sure, that you are generally right because of what you have experienced and did nothing apparently wrong.

Here the FULL code (your question is missing hints which modules are necessary) AND the image is actually ONLINE so anyone else could also test if the code works or not (you didn't provide an online existing image in your question):

import io
import requests
import pytesseract
from PIL import Image
response = requests.get("http://www.teamjimmyjoe.com/wp-content/uploads/2014/09/Classic-Best-Funny-Text-Messages-earthquake-titties.jpg")
# print( type(response) ) # <class 'requests.models.Response'>
img = Image.open(io.BytesIO(response.content))
# print( type(img) ) # <class 'PIL.JpegImagePlugin.JpegImageFile'>
text = pytesseract.image_to_string(img)
print( text )

Here the pytesseract output:

Hey! I just saw on CNN
there was an earthquake
near you. Are you ok?






‘ Yes! We‘re all line!

What did it rate on the titty
scale?
‘ Well they only jiggled a

little bit, so probably not

that high.
HAHAHAHAHAHA I LOVE
YOU
Richter scale. My phone is l
a 12 yr old boy.

My system: Linux Mint 18.1 with Python 3.6