Search code examples
pythonweb-scrapingscrapyscraper

What impact on raising CloseSpider in Scrapy?


I want to know what impact of raising CloseSpider. In documentation http://doc.scrapy.org/en/latest/topics/exceptions.html#closespider there is no information about it. As you know scrapy process a few requests at the same time. What if this exception will be raised before last request will be handled? Will it wait for handling of rest requests that was prduced before? Example:

def parse(self, response):
    my_url = 'http://someurl.com/item/'
    for i in range(1, 100):
         my_url += str(i)
         if i == 50:
             raise CloseSpider('')
         else:
             yield Request(url=my_url, callback=self.my_handler)

def my_handler(self, response):
     # handler

Thanks for your responses.

======================== Possible solution:

is_alive = True

def parse(self, response):
    my_url = 'http://url.com/item/'
    for i in range(1, 100):
        if not is_alive:
            break
        my_url += str(i)
        yield Request(url=my_url, callback=self.my_handler)

def my_handler(self, response):
    if (response do not contains new item):
        is_alive = False

Solution

  • According to the source code, if there is a CloseSpider exception being raised, engine.close_spider() method would be executed:

    def handle_spider_error(self, _failure, request, response, spider):
        exc = _failure.value
        if isinstance(exc, CloseSpider):
            self.crawler.engine.close_spider(spider, exc.reason or 'cancelled')
            return
    

    engine.close_spider() itself would close the spider and clear all outstanding requests:

    def close_spider(self, spider, reason='cancelled'):
        """Close (cancel) spider and clear all its outstanding requests"""
    
        slot = self.slot
        if slot.closing:
            return slot.closing
        logger.info("Closing spider (%(reason)s)",
                    {'reason': reason},
                    extra={'spider': spider})
    
        dfd = slot.close()
    
        # ...
    

    It would also schedule close_spider() calls for different components of Scrapy's architecture: downloader, scraper, scheduler etc.