python web-scraping python-requests-html

Python Requests HTML - img src gets scraped with data:image/gif;base64

I tried to scrape product images with requests html (cannot use BeautifulSoup because it dynamically loads using JavaScript).

I found and extracted the image src attribute from the product page with following:

images = r.html.find('img.product-media-gallery__item-image')
for image in images:
    print(image.attrs["src"])

But the output looks like this. I already tried to replace the string of the small image need with a blank string, but then nothing gets scraped at all from the image source.

What can I do to remove the pixel-size images and only keep the useful product image URL?

Solution

Those pixel-sized images are placeholders for the actual images. As you said, the data is dynamically loaded using JavaScript, and that is the only way to get the image links. You can do this by parsing the HTML data and getting the JSON links from there.

Start by downloading your page HTML:

from requests import get

html_data = get("https://www.coolblue.nl/product/858330/sony-kd-65xh9505-2020.html").text

You can use a regex statement to extract the image JSON data from the HTML source code, then unescape the HTML-encoded characters:

import re
from html import unescape

decoded_html = unescape(re.search('<div class="product-media-gallery js-media-gallery"\s*data-component="(.*)"', html_data).groups()[0])

You can now load the JSON to a python dictionary like so:

from json import loads

json_data = loads(decoded_html)

Then simply traverse down the JSON until you find your list of image links:

images = json_data[3]["options"]["images"]

print(images)

Put all together, the script looks like so:

from requests import get
import re
from html import unescape
from json import loads

# Download the page
html_data = get("https://www.coolblue.nl/product/858330/sony-kd-65xh9505-2020.html").text

# Decode the HTML and get the JSON
decoded_html = unescape(re.search('<div class="product-media-gallery js-media-gallery"\s*data-component="(.*)"', html_data).groups()[0])

# Load it as a dictionary
json_data = loads(decoded_html)

# Get the image list
images = json_data[3]["options"]["images"]

print(images)