Search code examples
pythonpython-requestspython-imaging-library

Downloading certain images from Wikipedia results in unexpected UnidentifiedImageError


Up until now I have been downloading certain Wikipedia images using PIL and the requests library without issue. At some point something somewhere was changed and now I get an error when trying to download and load the following images:

from PIL import Image
import requests

url_1 = "https://upload.wikimedia.org/wikipedia/commons/thumb/e/ea/" \
    + "Van_Gogh_-_Starry_Night_-_Google_Art_Project.jpg/2728px-Van_Gogh_-_Starry_Night_-_Google_Art_Project.jpg"

#url_2 = "https://upload.wikimedia.org/wikipedia/commons/9/9d/The_Scream_by_Edvard_Munch%2C_1893_-_Nasjonalgalleriet.png"

#url_3 = "https://upload.wikimedia.org/wikipedia/en/8/8f/Pablo_Picasso%2C_1909-10%2C_Figure_dans_un_Fauteuil_%28Seated_Nude%" \
#    + "2C_Femme_nue_assise%29%2C_oil_on_canvas%2C_92.1_x_73_cm%2C_Tate_Modern%2C_London.jpg"


response = requests.get(url_1, stream=True)
img = Image.open(response.raw)

And the resulting error message:

---------------------------------------------------------------------------

UnidentifiedImageError                    Traceback (most recent call last)

<ipython-input-2-9f0ecb1762aa> in <module>()
     13 
     14 response = requests.get(url_1, stream=True)
---> 15 img = Image.open(response.raw)

/usr/local/lib/python3.7/dist-packages/PIL/Image.py in open(fp, mode)
   2894         warnings.warn(message)
   2895     raise UnidentifiedImageError(
-> 2896         "cannot identify image file %r" % (filename if filename else fp)
   2897     )
   2898 

UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7f9b71d22bf0>

The error itself isn't very descriptive and I haven't been able to figure out how to fix it. Any help would be greatly appreciated. The URLs themselves lead to a perfectly normal image, and the code has been working up until this point.


Solution

  • Your problem is that Wikipedia wants a user agent header in your request. If you provide the user-agent header in the request then you will get back the image as you are expecting.

    You can determine this is the problem by looking at the text of the response. For example, I copy/pasted your request and looked at the text of the response. The text says "Error: 403, Forbidden. Please comply with the User-Agent policy". That's how I knew what you were missing was the user agent.

    For the user agent, you should probably supply something a bit more descriptive than the placeholder I use in my example. Maybe the name of your script, or just the word "script" or something like that.

    headers = {
        'User-Agent': 'My User Agent 1.0'
    }
    picture_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Van_Gogh_-_Starry_Night_-_Google_Art_Project.jpg/2728px-Van_Gogh_-_Starry_Night_-_Google_Art_Project.jpg"
    r = requests.get(picture_url, headers=headers, stream=True)
    Image.open(r.raw)