I am scraping a page, using both Scrapy and Splash. The page contains a dropdown box (technically, a select HTML element). Each time an element is selected in the dropdown box, a new page is loaded using AJAX.
The HTML segment below, is a simplified version of the page I'm processing:
<html>
<head><title>Title goes here ...</title></head>
<body>
<select class="foo">
<option value=100 data-reactid=1>One</option>
<option value=200 data-reactid=2>Two</option>
<!-- ... -->
<option value=900 data-reactid=9>Nine</option>
</select>
</body>
</html>
# Fetch the options ... now what ?
options = response.css("select[class=foo] option[data-reactid]")
How do I programatically use Splash to 'click' and receive the reloaded AJAX page in my response object?
You might try to use Splash's execute
endpoint with LUA script that will fill the select
with each option
's value and return the result. Something like:
...
script = """
function main(splash)
splash.resource_timeout = 10
splash:go(splash.args.url)
splash:wait(1)
splash:runjs('document.getElementsByClassName("foo")[0].value = "' .. splash.args.value .. '"')
splash:wait(1)
return {
html = splash:html(),
}
end
"""
# base_url refers to page with the select
values = response.xpath('//select[@class="foo"]/option/@value').extract()
for value in values:
yield scrapy_splash.SplashRequest(
base_url, self.parse_result, endpoint='execute',
args={'lua_source': script, 'value': value, 'timeout': 3600})
Of course, this isn't tested, but you might start there and play with it.