Search code examples
pythonhtmlscrapyscreen-scraping

Using Scrapy's FormRequest.from_response method automate scraping of dropdown menu wise data


I have been struggling with this for the past two days. I need to scrape data from this website for all "Cadres" or categories. Unfortunately, the website allows access to this data via a dropdown menu "Select Cadre" which doesn't have an "All Categories" option. To circumvent this, I am using Scrapy's FormRequest.from_response method but the spider is returning a blank file with no data in it. Any help is appreciated. Here's the code:

import scrapy

class IASWinnerSpider(scrapy.Spider):

    name = 'iaswinner_list'
    allowed_domains = ['http://civillist.ias.nic.in']

    def start_requests(self):
        urls = [ 'http://civillist.ias.nic.in/UpdateCL/DraftCL.asp' ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        return scrapy.FormRequest.from_response(response, method='POST',
                    formdata={'cboCadre': 'UT'}, dont_click=True, callback=self.after_post)

    def after_post(self, response):

        table      = response.xpath('/html/body/div/table//tr')

        for t in table:

            yield {
                'serial': t.xpath('td[1]/text()').extract(),
                'name': t.xpath('td[2]/text()').extract(),
                'qual': t.xpath('td[3]/text()').extract(),
                'dob': t.xpath('td[4]/text()').extract(),
                'post': t.xpath('td[5]/text()').extract(),
                'rem': t.xpath('td[6]/text()').extract(),
            }

Solution

  • When I run your code, I see this in the log:

    2017-08-19 15:52:20 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'civillist.ias.nic.in': <POST http://civillist.ias.nic.in/UpdateCL/DraftCL.asp>
    

    Just change allowed_domains to this:

    allowed_domains = ['civillist.ias.nic.in']
    

    and it works.