Search code examples
pythonajaxscreen-scrapingscrapyweb-crawler

Parsing ajax responses to retrieve final url content in Scrapy?


I have the following problem:

My scraper starts at a "base" URL. This page contains a dropdown that creates another dropdown via ajax calls, and this cascades 2-3 times until it has all the information needed to get to the "final" page where the actual content I want to scrape is.

Rather than clicking things (and having to use Selenium or similar) I use the pages exposed JSON API to mimic this behavior, so instead of clicking dropdowns I simply send a request and read JSON responses that contain the array of information used to generate the next dropdown's contents, and do this until I have the final URL for one item. This URL takes me to the final item page that I want to actually parse.

I am confused about how to use Scrapy to get the "final" url for every combination of dropdown boxes. I wrote a crawler using urllib that used a ton of loops to just iterate through every combination of url, but Scrapy seems to be a bit different. I moved away from urllib and lxml because Scrapy seemed like a more maintainable solution, which is easier to integrate with Django projects.

Essentially, I am trying to force Scrapy to take a certain path that I generate along the way as I read the contents of the json responses, and only really parse the last page in the chain to get real content. It needs to do this for every possible page, and I would love to parallelize it so things are efficient (and use Tor, but these are later issues).

I hope I have explained this well, let me know if you have any questions. Thank you so much for your help!

Edit: Added an example

[base url]/?location=120&section=240

returns:

<departments>
<department id="62" abrev="SIG" name="name 1"/>
<department id="63" abrev="ENH" name="name 2"/>
<department id="64" abrev="GGTH" name="name 3"/>
...[more]
</departments>

Then I grab the department id, add it to the url like so:

[base url]/?location=120&section=240&department_id=62

returns:

<courses>
<course id="1" name="name 1"/>
<course id="2" name="name 2"/>
</courses>

This continues until I end up with the actual link to the listing.

This is essentially what this looks like on the page (though in my case, there is a final "submit" button on the form that sends me to the actual listing that I want to parse): http://roshanbh.com.np/dropdown/

So, I need some way of scraping every combination of the dropdowns so that I get all the possible listing pages. The intermediate step of walking the ajax xml responses to generate final listing URLs is messing me up.


Solution

  • You can use a chain of callback functions starting for the main callback function, say you're implementing a spider extending BaseSpider, write your parse function like this:

    ...
    
    def parse(self, response):
      #other code
      yield Request (url=self.baseurl, callback=self.first_dropdown)
    
    def first_dropdown (self, response):
      ids=self.parse_first_response()   #Code for parsing the first dropdown content
      for (i in ids):
        req_url=response.url+"/?location="+i
        yield Request (url=req_url, callback=self.second_dropdown)
    
    def second_dropdown (self, response):
      ids=self.parse_second_response()   #Code for parsing the second dropdown contents
      url=self.base_url
      for (i in ids):
        req_url=response.url+"&section="+i
        yield Request (url=req_url, callback=self.third_dropdown)
    
    ...
    

    the last callback function will have the code needed to extract your data.

    Be careful, you're asking to try all possible combinations of input and this can lead you to an high number of requests very fast.