web-scrapingscrapyscrapy-splashscrapy-shellsplash-js-render

Scrapy Shell and Scrapy Splash


We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container.

If we want to use Splash in the spider, we configure several required project settings and yield a Request specifying specific meta arguments:

yield Request(url, self.parse_result, meta={
    'splash': {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,

            # 'url' is prefilled from request url
        },

        # optional parameters
        'endpoint': 'render.json',  # optional; default is render.json
        'splash_url': '<url>',      # overrides SPLASH_URL
        'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
    }
})

This works as documented. But, how can we use scrapy-splash inside the Scrapy Shell?


Solution

  • just wrap the URL you want to shell to in splash HTTP API.

    So you would want something like:

    scrapy shell 'http://localhost:8050/render.html?url=http://example.com/page-with-javascript.html&timeout=10&wait=0.5'
    

    where:

    • localhost:port is where your splash service is running
    • url is URL you want to crawl and don't forget to urlquote it!
    • render.html is one of the possible HTTP API endpoints, returns redered HTML page in this case
    • timeout time in seconds for timeout
    • wait time in seconds to wait for JavaScript to execute before reading/saving the HTML.