Search code examples
javascriptscrapyrenderwaitsplash-screen

Scrapy Splash not respecting Rendering "wait" time


I'm using Scrapy and Splash to scrape this page : https://www.athleteshop.nl/shimano-voor-as-108mm-37184

Here's the image I get in Scrapy Shell with view(response): scrapy shell img

I need the barcode highlighted in red. But it's generated in javascript as it can be seen in the source code in Chrome with F12. However, although displayed correctly in both Scrapy Shell and Splash localhost, although Splash localhost gives me the right html, the barcode I want to select always equals to None with response.xpath("//table[@class='data-table']//tr[@class='even']/td[@class='data last']/text()").extract_first().

The selector isn't the problem since it works in Chrome's source code. I've been looking for the answer on the web and SO for two days and no one seems to have the same problem. Is it just that Splash doesn't support it ? The settings are the classic ones as follows :

SPLASH_URL = 'http://192.168.99.100:8050/'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 
810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

My code is as follows (the parse part aims at clicking on the link provided by a search engine inside the website. It works fine) :

    def parse(self, response):
        try :
            link=response.xpath("//li[@class='item last']/a/@href").extract_first()
            yield SplashRequest(link, self.parse_item, endpoint = 'render.html', args={'wait': 20})
        except Exception as e:
            print (str(e))


    def parse_item(self, response):
        product = {}
        product['name']=response.xpath("//div[@class='product-name']/h1/text()").extract_first()
        product['ean']=response.xpath("//table[@class='data-table']//tr[@class='even']/td[@class='data last']/text()").extract_first()
        product['price']=response.xpath("//div[@class='product-shop']//p[@class='special-price']/span[@class='price']/text()").extract_first()
        product['image']=response.xpath("//div[@class='item image-photo']//img[@class='owl-product-image']/@src").extract_first()
        print (product['name'])
        print (product['ean'])
        print (product['image'])

The print on the name and the image url work perfectly fine since they're not generated by javascript. The code is alright, the settings are fine, the Splash localhost shows me something good, but my selectors don't work in the execution of the script (which doesn't show any errors), neither in Scrapy Shell.

The problem might be that Scrapy Splash instantly renders without caring about the wait time (20secs !) put in argument. What did I do wrong, please ?

Thanks in advance.


Solution

  • It doesn't seem to me, that the content of barcode field is generated dynamically, I can see it in the page source and extract from scrapy shell with response.css('.data-table tbody tr:nth-child(2) td:nth-child(2)::text').extract_first().