I was trying to scrape the family information of Indian Rajya Sabha members found here http://164.100.47.5/Newmembers/memberlist.aspx
Being a newbie in scrapy, I followed this and this example codes to produce the following.
def parse(self,response):
print "Inside parse"
requests = []
target_base_prefix = 'ctl00$ContentPlaceHolder1$GridView2$ctl'
target_base_suffix = '$lkb'
for i in range(2,5):
if i < 10:
target_id = "0"+str(i)
else:
target_id = str(i)
evTarget = target_base_prefix+target_id+target_base_suffix
form_data = {'__EVENTTARGET':evTarget,'__EVENTARGUMENT':''}
requests.append(scrapy.http.FormRequest.from_response(response, formdata = form_data,dont_filter=True,method = 'POST', callback = self.parse_politician))
for r in requests:
print "before yield"+str(r)
yield r
def parse_pol_bio(self,response):
print "[parse_pol_bio]- response url - "+response.url
name_xp = '//span[@id=\"ctl00_ContentPlaceHolder1_GridView1_ctl02_Label3\"]/font/text()'
base_xp_prefix = '//*[@id=\"ctl00_ContentPlaceHolder1_TabContainer1_TabPanel2_ctl00_DetailsView2_Label'
base_xp_suffix='\"]/text()'
father_id = '12'
mother_id = '13'
married_id = '1'
spouse_id = '3'
name = response.xpath(name_xp).extract()[0].strip()
name = re.sub(' +', ' ',name)
father = response.xpath(base_xp_prefix+father_id+base_xp_suffix).extract()[0].strip()
mother = response.xpath(base_xp_prefix+mother_id+base_xp_suffix).extract()[0].strip()
married = response.xpath(base_xp_prefix+married_id+base_xp_suffix).extract()[0].strip().split(' ')[0]
if married == "Married":
spouse = response.xpath(base_xp_prefix+spouse_id+base_xp_suffix).extract()[0].strip()
else:
spouse = ''
print 'name marital_stat father_name mother_name spouse'
print name,married,father,mother,spouse
item = RsItem()
item['name'] = name
item['spouse'] = spouse
item['mother'] = mother
item['father'] = father
return item
def parse_politician(self,response):
evTarget = 'ctl00$ContentPlaceHolder1$TabContainer1'
evArg = 'activeTabChanged:1'
formdata = {'__EVENTTARGET':evTarget,'__EVENTARGUMENT':evArg}
print "[parse_politician]-response url - "+response.url
return scrapy.FormRequest.from_response(response, formdata,method = 'POST', callback = self.parse_pol_bio)
Explanation
The parse
method loops over the target id for different politicians and sends requests.
parse_politician
- for tab changing purpose
parse_politician_bio
does the scraping of dependency names.
Problem
Problem is that this causes duplicate responses to parse_politician_bio.
i.e Information about same person is coming multiple times.
The nature of duplicate responses are quite random at each run i.e - different politicans data may be duplicated at each response.
I have already checked whether there are any request being yield multiple times but none are.
Also tried to put some sleep after each yield request to see if it helps.
I suspect the scrapy Request Scheduler here.
Is there any other problem in the code??Can anything be done to avoid this?
EDIT
Just to clarify something here, I know what dont_filter=True does and have deliberately kept that.
The problem is that some response data are getting replaced. For example, when I generate 3 requests, for target_id = 1,2,3 separately. Response for target_id = 1 is getting replaced by a response for target_id = 2 .
[So that makes me have one response for target_id - 3 and two for target_id -2]
Expected output (csv)
politician name , spouse name , father name , mother name
pol1 , spouse1, father1, mother1
pol2 , spouse2, father2, mother2
pol3 , spouse3, father3, mother3
Output given (csv)
politician name , spouse name , father name , mother name
pol1 , spouse1, father1, mother1
pol1 , spouse1, father1, mother1
pol3 , spouse3, father3, mother3
Finally fixed it (phew!).
By default scrapy sends 16 requests at a time (concurrent requests).
Putting the CONCURRENT_REQUESTS = 1
in settings.py file made that sequential and solved the issue.
The requests I gave were similar (check above), and the responses data got overlapped with one another to give duplicate responses of one type only.
No idea how that was happening though, but the solution by making sequential requests confirms this.
Any better explanations?