OK I'm trying to scrape jpg image from Gucci website. Take this one as example.
I tried urllib.urlretrieve, which doesn't work becasue Gucci blocked the function. So I wanted to use requests to scrape the source code for the image and then write it into a .jpg file.
image = requests.get("http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/277520_F4CYG_4080_001_web_full_new_theme.jpg").text.encode('utf-8')
I encoded it because if I don't, it keeps telling me that gbk cannot encode the string.
Then:
with open('1.jpg', 'wb') as f:
f.write(image)
looks good right? But the result is -- the jpg file cannot be opened. There's no image! Windows tells me the jpg file is damaged.
What could be the problem?
I'm thinking that maybe when I scraped the image, I lost some information, or some characters are wrongly scraped. But how can I find out which?
I'm thinking that maybe some information is lost via encoding. But if I don't encode, I cannot even print it, not to mention writing it into a file.
What could go wrong?
I am not sure about the purpose of your use of encode
. You're not working with text, you're working with an image. You need to access the response as binary data, not as text, and use image manipulation functions rather than text ones. Try this:
from PIL import Image
from io import BytesIO
import requests
response = requests.get("http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/277520_F4CYG_4080_001_web_full_new_theme.jpg")
bytes = BytesIO(response.content)
image = Image.open(bytes)
image.save("1.jpg")
Note the use of response.content
instead of response.text
. You will need to have PIL or Pillow installed to use the Image
module. BytesIO
is included in Python 3.
Or you can just save the data straight to disk without looking at what's inside:
import requests
response = requests.get("http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/277520_F4CYG_4080_001_web_full_new_theme.jpg")
with open('1.jpg','wb') as f:
f.write(response.content)