Search code examples
pythonweb-scrapingpython-requests-html

Python Requests HTML - img src gets scraped with data:image/gif;base64


I tried to scrape product images with requests html (cannot use BeautifulSoup because it dynamically loads using JavaScript).

I found and extracted the image src attribute from the product page with following:

images = r.html.find('img.product-media-gallery__item-image')
for image in images:
    print(image.attrs["src"])

But the output looks like this. I already tried to replace the string of the small image need with a blank string, but then nothing gets scraped at all from the image source.

What can I do to remove the pixel-size images and only keep the useful product image URL?


Solution

  • Those pixel-sized images are placeholders for the actual images. As you said, the data is dynamically loaded using JavaScript, and that is the only way to get the image links. You can do this by parsing the HTML data and getting the JSON links from there.

    Start by downloading your page HTML:

    from requests import get
    
    html_data = get("https://www.coolblue.nl/product/858330/sony-kd-65xh9505-2020.html").text
    

    You can use a regex statement to extract the image JSON data from the HTML source code, then unescape the HTML-encoded characters:

    import re
    from html import unescape
    
    decoded_html = unescape(re.search('<div class="product-media-gallery js-media-gallery"\s*data-component="(.*)"', html_data).groups()[0])
    

    You can now load the JSON to a python dictionary like so:

    from json import loads
    
    json_data = loads(decoded_html)
    

    Then simply traverse down the JSON until you find your list of image links:

    images = json_data[3]["options"]["images"]
    
    print(images)
    

    Put all together, the script looks like so:

    from requests import get
    import re
    from html import unescape
    from json import loads
    
    # Download the page
    html_data = get("https://www.coolblue.nl/product/858330/sony-kd-65xh9505-2020.html").text
    
    # Decode the HTML and get the JSON
    decoded_html = unescape(re.search('<div class="product-media-gallery js-media-gallery"\s*data-component="(.*)"', html_data).groups()[0])
    
    # Load it as a dictionary
    json_data = loads(decoded_html)
    
    # Get the image list
    images = json_data[3]["options"]["images"]
    
    print(images)