Search code examples
pythonpython-3.xweb-scrapingscrapyscrapy-splash

Scrapy-splash response.css() can't get an element


I'm trying to scrape from a dynamic JS content website, I'm trying to get the breadcrumbs of the current page.

enter image description here

The breadcrumbs is consisted of 4 classes named : '.breadcrumbs-link'

enter image description here

To do so, I wrote this code, using scrapy-splash:

import scrapy
from scrapy_splash import SplashRequest


class MySpider(scrapy.Spider):
    name = "quotes4"

    start_urls = ["https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse, endpoint='render.html',args= {'wait': 10})
    
    def parse(self, response):
        print ('Result:')
        print(len(response.css('.breadcrumbs-link').extract())) # OUTPUT: 0
        print(response.css('.breadcrumbs-link').extract()) # OUTPUT: []

What could be wrong about my approach ?


Solution

  • This website (https://www.woolworths.com.au) uses Angular. If you go at the Splash FAQ page, there is a section "Website is not rendered correctly" where we can see:

    non-working localStorage in Private Mode. This is a common issue e.g. for websites based on AngularJS. If rendering doesn’t work, try disabling Private mode (see How do I disable Private mode?).

    And at the link we can see:

    How do I disable Private mode?

    With Splash>=2.0, you can disable Private mode (which is “on” by default). There are two ways to go about it:

    at startup, with the --disable-private-mode argument, e.g., if you’re using Docker:

    $ sudo docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode
    

    at runtime when using the /execute endpoint and setting splash.private_mode_enabled attribute to false

    The easy way is to disable private mode by using --disable-private-mode, but if you don't want to do that, then you can pass a Lua script that temporarily disables private mode for your spider and then enables it again when it's finished:

    import scrapy
    from scrapy_splash import SplashRequest
    
    LUA_SCRIPT = """
    function main(splash)
        splash.private_mode_enabled = false
        splash:go(splash.args.url)
        splash:wait(2)
        html = splash:html()
        splash.private_mode_enabled = true
        return html
    end
    """
    
    class MySpider(scrapy.Spider):
        name = "quotes4"
    
        start_urls = ["https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas"]
    
        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(url=url,
                                    callback=self.parse,
                                    endpoint='execute',
                                    args={
                                        'wait': 1,
                                        "lua_source":LUA_SCRIPT})
    
        def parse(self, response):
            print ('Result:')
            print(".breadcrumbs-link len = %d" % (len(response.css('.breadcrumbs-link').extract()))) # OUTPUT: 4
            print(".breadcrumbs-link = %s" % (response.css('.breadcrumbs-link').extract())) # OUTPUT: [...HTML ELEMENTS...]
    

    It worked for me by disabling private mode with result:

    Result:
    .breadcrumbs-link len = 4
    .breadcrumbs-link = ['<li class="breadcrumbs-link" ng-repeat="link ....