Search code examples
pythonweb-scrapingscrapyscrapy-splash

Scrapy-splash Can't find image source url


I am trying to scrape a product page from ZARA. Like this one :https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115

My scrapy-splash container is running. In the shell I fetch the page

fetch('http://localhost:8050/render.html?url=https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115')
2021-05-14 14:30:42 [scrapy.core.engine] INFO: Spider opened
2021-05-14 14:30:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://localhost:8050/render.html?url=https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115> (referer: None)

Everything is working so far, and I am able to get the header and price. However, I want to get image URLs of the product.

I try to reach it by

response.css('img.media-image__image::attr(src)').getall()

But response is this:

['https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png']

Which is all background image and not the real one. I can display images on the browser and I see that images coming in the network requests. Is it because it is loaded with AJAX requests? How do I solve this?


Solution

  • @samuelhogg deserves the credit for finding the json, but here is an example spider showing how to get all the image urls from the page. Note that you don't even need to use splash here, I've not tested it with splash but I think it should still work.

    from scrapy import Spider
    import json
    
    
    class Zara(Spider):
        name = "zara"
        start_urls = [
            "https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115"
        ]
      
        def parse(self, response):
            # Find the json identified by @samuelhogg
            data = response.css("script[type='application/ld+json']::text").get()
            # Make a set of all the images in the json
            images = {image for i in json.loads(data) for image in i["image"]}
            # Do what you want with them!
            print(images)