Search code examples
pythonweb-scrapingscreen-scrapingscrapy

Avoiding overlapping of responses in python scrapy


I was trying to scrape the family information of Indian Rajya Sabha members found here http://164.100.47.5/Newmembers/memberlist.aspx
Being a newbie in scrapy, I followed this and this example codes to produce the following.

def parse(self,response):

    print "Inside parse"
    requests = []
    target_base_prefix = 'ctl00$ContentPlaceHolder1$GridView2$ctl'
    target_base_suffix = '$lkb'

    for i in range(2,5):
        if i < 10:
            target_id = "0"+str(i)
        else:
            target_id = str(i)

        evTarget = target_base_prefix+target_id+target_base_suffix

        form_data = {'__EVENTTARGET':evTarget,'__EVENTARGUMENT':''}

        requests.append(scrapy.http.FormRequest.from_response(response, formdata = form_data,dont_filter=True,method = 'POST', callback = self.parse_politician))

    for r in requests:
        print "before yield"+str(r)
        yield r


def parse_pol_bio(self,response):

    print "[parse_pol_bio]- response url - "+response.url

    name_xp = '//span[@id=\"ctl00_ContentPlaceHolder1_GridView1_ctl02_Label3\"]/font/text()'
    base_xp_prefix = '//*[@id=\"ctl00_ContentPlaceHolder1_TabContainer1_TabPanel2_ctl00_DetailsView2_Label'
    base_xp_suffix='\"]/text()'
    father_id = '12'
    mother_id = '13'
    married_id = '1'
    spouse_id = '3'

    name = response.xpath(name_xp).extract()[0].strip()
    name = re.sub(' +', ' ',name)

    father = response.xpath(base_xp_prefix+father_id+base_xp_suffix).extract()[0].strip()
    mother = response.xpath(base_xp_prefix+mother_id+base_xp_suffix).extract()[0].strip()
    married = response.xpath(base_xp_prefix+married_id+base_xp_suffix).extract()[0].strip().split(' ')[0]

    if married == "Married":
        spouse = response.xpath(base_xp_prefix+spouse_id+base_xp_suffix).extract()[0].strip()
    else:
        spouse = ''

    print 'name     marital_stat    father_name     mother_name     spouse'
    print name,married,father,mother,spouse

    item = RsItem()
    item['name'] = name
    item['spouse'] = spouse
    item['mother'] = mother
    item['father'] = father

    return item



def parse_politician(self,response):

    evTarget = 'ctl00$ContentPlaceHolder1$TabContainer1'
    evArg =  'activeTabChanged:1'
    formdata = {'__EVENTTARGET':evTarget,'__EVENTARGUMENT':evArg}

    print "[parse_politician]-response url - "+response.url

    return scrapy.FormRequest.from_response(response, formdata,method = 'POST', callback = self.parse_pol_bio)

Explanation
The parse method loops over the target id for different politicians and sends requests.
parse_politician - for tab changing purpose
parse_politician_bio does the scraping of dependency names.

Problem
Problem is that this causes duplicate responses to parse_politician_bio.
i.e Information about same person is coming multiple times.
The nature of duplicate responses are quite random at each run i.e - different politicans data may be duplicated at each response.
I have already checked whether there are any request being yield multiple times but none are.
Also tried to put some sleep after each yield request to see if it helps.
I suspect the scrapy Request Scheduler here.

Is there any other problem in the code??Can anything be done to avoid this?

EDIT
Just to clarify something here, I know what dont_filter=True does and have deliberately kept that.

The problem is that some response data are getting replaced. For example, when I generate 3 requests, for target_id = 1,2,3 separately. Response for target_id = 1 is getting replaced by a response for target_id = 2 .
[So that makes me have one response for target_id - 3 and two for target_id -2]

Expected output (csv)

politician name , spouse name , father name , mother name
pol1 , spouse1, father1, mother1
pol2 , spouse2, father2, mother2
pol3 , spouse3, father3, mother3

Output given (csv)

politician name , spouse name , father name , mother name
pol1 , spouse1, father1, mother1
pol1 , spouse1, father1, mother1
pol3 , spouse3, father3, mother3

Solution

  • Finally fixed it (phew!).
    By default scrapy sends 16 requests at a time (concurrent requests).
    Putting the CONCURRENT_REQUESTS = 1 in settings.py file made that sequential and solved the issue.

    The requests I gave were similar (check above), and the responses data got overlapped with one another to give duplicate responses of one type only.

    No idea how that was happening though, but the solution by making sequential requests confirms this.
    Any better explanations?