Search code examples
pythonweb-scrapingscrapysplash-screenscrapy-splash

Click display button in Scrapy-Splash


I am scraping the following webpage using scrapy-splash, http://www.starcitygames.com/buylist/, which I have to login to, to get the data I need. That works fine but in order to get the data I need to click the display button so I can scrape that data, the data I need is not accessible until the button is clicked. I already got an answer to this that told me I cannot simply click the display button and scrape the data that shows up and that I need to scrape the JSON webpage associated with that information but I am concerned that scraping the JSON instead will be a red flag to the owners of the site since most people do not open the JSON data page and it would take a human several minutes to find it versus the computer which would be much faster. So I guess my question is, is there anyway to scrape the webpage my clicking display and going from there or do I have no choice but to scrape the JSON page? This is what I have got so far... but it is not clicking the button.

import scrapy
from ..items import NameItem

class LoginSpider(scrapy.Spider):
    name = "LoginSpider"
    start_urls = ["http://www.starcitygames.com/buylist/"]

    def parse(self, response):
        return scrapy.FormRequest.from_response(
        response,
        formcss='#existing_users form',
        formdata={'ex_usr_email': '[email protected]', 'ex_usr_pass': 'password'},
        callback=self.after_login
        )



    def after_login(self, response):
        item = NameItem()
        display_button = response.xpath('//a[contains(., "Display>>")]/@href').get()

        yield response.follow(display_button, self.parse)

        item["Name"] = response.css("div.bl-result-title::text").get()
        return item

Snapshot of website HTML COde


Solution

  • You can use the developer tools of your browser to track the request of that click event, which is in a nice JSON format, also no need for cookie (login):

    http://www.starcitygames.com/buylist/search?search-type=category&id=5061

    The only thing need to fill is the category_id related to this request, this can be extracted from the HTML and declared in your code.

    Category name:

    //*[@id="bl-category-options"]/option/text()
    

    Category id:

    //*[@id="bl-category-options"]/option/@value
    

    Working with JSON is much more simple than parsing HTML.