Search code examples
pythonseleniumweb-scrapingpython-requestsurllib

Issue when downloading image from url with Python


I am trying to download an image from a URL with Python using the requests and shutil libraries. My code is below:

import requests
import shutil

image_url = "https://www.metmuseum.org/-/media/images/visit/met-fifth-avenue/fifthave_teaser.jpg"

with open("image1.jpg", "wb") as file:
    response = requests.get(image_url, stream=True)
    response.raw.decode_content = True
    shutil.copyfileobj(response.raw, file)
file.close()

This code works for most other image urls that I have tried (eg: https://tinyjpg.com/images/social/website.jpg). However, for the image_url in the code, a 1kb file is created with an error that says "It looks like we don't support this file format."

I have also tried:

import urllib
urllib.request.urlretrieve(image_url, "image1.jpg)

It is possible to do this using Seleniumwire - I used driver.requests to get a list of all requests made by the site, and then looped through these requests until I got a request.response.header that included the file type (.jpg). It appears that there are two requests with the same url (the first with content-type 'text/html' and the second with 'image/jpg').

I would like to run this without loading a WebDriver. Is there any way I can download an image like this using the requests function?


Solution

  • If you view the response.text you'll see that the server doesn't like your request headers and thinks you're a robot:

    '<html>\r\n<head>\r\n<META NAME="robots" CONTENT="noindex,nofollow">\r\n<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">\r\n</script>\r\n<body>\r\n</body></html>\r\n'
    

    But if you provide a proper User-Agent header its response changes and you can proceed with saving the file:

    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36'} 
    
    response = requests.get(image_url, stream=True, headers=headers)
    
    with open("image1.jpg", "bw") as file:
        file.write(response.content)
    

    So you have to mock a user-agent in the request headers to get this image.

    Also, with is a context manager, it already closes the file for you.