My code is included below and is really not much more than a slightly tweaked version of the example lifted from Scrapy's documentation. The code works as-is, but I there is a gap in the logic I am not understanding between the login and how the request is passed through subsequent requests.
According to the documentation, a request object returns a response object. This response object is passed as the first argument to a callback function. This I get. This is the way authentication can be handled and subsequent requests made using the user credentials.
What I am not understanding is how the response object makes it to the next request call following authentication. In my code below, the parse method returns a result object created when authenticating using the FormRequest method. Since the FormRequest has a callback to the after_login method, the after_login method is called with the response from the FormRequest as the first parameter.
The after_login method checks to make sure there are no errors, then makes another request through a yield statement. What I do not understand is how the response passed in as an argument to the after_login method is making it to the Request following the yield. How does this happen?
The primary reason why I am interested is I need to make two requests per iterated value in the after_login method, and I cannot figure out how the responses are being handled by the scraper to then understand how to modify the code. Thank you in advance for your time and explanations.
# import Scrapy modules
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy import log
# import custom item from item module
from scrapy_spage.items import ReachItem
class AwSpider(BaseSpider):
name = 'spage'
allowed_domains = ['webpage.org']
start_urls = ('https://www.webpage.org/',)
def parse(self, response):
credentials = {'username': 'user',
'password': 'pass'}
return [FormRequest.from_response(response,
formdata=credentials,
callback=self.after_login)]
def after_login(self, response):
# check to ensure login succeeded
if 'Login failed' in response.body:
# log error
self.log('Login failed', level=log.ERROR)
# exit method
return
else:
# for every integer from one to 5000, 1100 to 1110 for testing...
for reach_id in xrange(1100, 1110):
# call make requests, use format to create four digit string for each reach
yield Request('https://www.webpage.org/content/River/detail/id/{0:0>4}/'.format(reach_id),
callback=self.scrape_page)
def scrape_page(self, response):
# create selector object instance to parse response
sel = Selector(response)
# create item object instance
reach_item = ReachItem()
# get attribute
reach_item['attribute'] = sel.xpath('//body/text()').extract()
# other selectors...
# return the reach item
return reach_item
how the response passed in as an argument to the after_login method is making it to the Request following the yield.
if I understand your question, the answer is that it doesn't
the mechanism is simple:
for x in spider.function():
if x is a request:
http call this request and wait for a response asynchronously
if x is an item:
send it to piplelines etc...
upon getting a response:
request.callback(response)
as you can see, there is no limit to the number of requests the function can yield so you can:
for reach_id in xrange(x, y):
yield Request(url=url1, callback=callback1)
yield Request(url=url2, callback=callback2)
hope this helps