I want to know what impact of raising CloseSpider. In documentation http://doc.scrapy.org/en/latest/topics/exceptions.html#closespider there is no information about it. As you know scrapy process a few requests at the same time. What if this exception will be raised before last request will be handled? Will it wait for handling of rest requests that was prduced before? Example:
def parse(self, response):
my_url = 'http://someurl.com/item/'
for i in range(1, 100):
my_url += str(i)
if i == 50:
raise CloseSpider('')
else:
yield Request(url=my_url, callback=self.my_handler)
def my_handler(self, response):
# handler
Thanks for your responses.
======================== Possible solution:
is_alive = True
def parse(self, response):
my_url = 'http://url.com/item/'
for i in range(1, 100):
if not is_alive:
break
my_url += str(i)
yield Request(url=my_url, callback=self.my_handler)
def my_handler(self, response):
if (response do not contains new item):
is_alive = False
According to the source code, if there is a CloseSpider
exception being raised, engine.close_spider()
method would be executed:
def handle_spider_error(self, _failure, request, response, spider):
exc = _failure.value
if isinstance(exc, CloseSpider):
self.crawler.engine.close_spider(spider, exc.reason or 'cancelled')
return
engine.close_spider()
itself would close the spider and clear all outstanding requests:
def close_spider(self, spider, reason='cancelled'):
"""Close (cancel) spider and clear all its outstanding requests"""
slot = self.slot
if slot.closing:
return slot.closing
logger.info("Closing spider (%(reason)s)",
{'reason': reason},
extra={'spider': spider})
dfd = slot.close()
# ...
It would also schedule close_spider()
calls for different components of Scrapy's architecture: downloader, scraper, scheduler etc.