Search code examples
pythonweb-scrapingscrapyscrapy-splash

Iterating through select items on AJAX page with Scrapy and Splash


I am scraping a page, using both Scrapy and Splash. The page contains a dropdown box (technically, a select HTML element). Each time an element is selected in the dropdown box, a new page is loaded using AJAX.

The HTML segment below, is a simplified version of the page I'm processing:

<html>
    <head><title>Title goes here ...</title></head>
    <body>
        <select class="foo">
            <option value=100 data-reactid=1>One</option>
            <option value=200 data-reactid=2>Two</option>
            <!-- ... -->
            <option value=900 data-reactid=9>Nine</option>
        </select>
    </body>
</html>

Snippet of my scrapy/splash code:

# Fetch the options ... now what ?
options = response.css("select[class=foo] option[data-reactid]")

How do I programatically use Splash to 'click' and receive the reloaded AJAX page in my response object?


Solution

  • You might try to use Splash's execute endpoint with LUA script that will fill the select with each option's value and return the result. Something like:

    ...
    script = """
    function main(splash)
        splash.resource_timeout = 10
        splash:go(splash.args.url)
        splash:wait(1)
        splash:runjs('document.getElementsByClassName("foo")[0].value = "' .. splash.args.value .. '"')
        splash:wait(1)
        return {
            html = splash:html(),
        }
    end
    """
    
    # base_url refers to page with the select
    values = response.xpath('//select[@class="foo"]/option/@value').extract()
    for value in values:
        yield scrapy_splash.SplashRequest(
            base_url, self.parse_result, endpoint='execute',
            args={'lua_source': script, 'value': value, 'timeout': 3600})
    

    Of course, this isn't tested, but you might start there and play with it.