Search code examples
pythonpython-2.7web-scrapingscrapyscrapy-middleware

How to retry IndexError in Scrapy


Sometimes I get IndexError because I successfully scrape only half of the page causing the parsing logic to get IndexError. How can I retry when I get IndexError?

It's ideally a middleware so it can handle multiple spiders at once.


Solution

  • In the end, I use a decorator and call _retry() function from RetryMiddleware in the decorator function. It works well. It's not the best, it's best to be able to have a middleware handling it. But it's better than nothing.

    from scrapy.downloadermiddlewares.retry import RetryMiddleware
    
    def handle_exceptions(function):
        def parse_wrapper(spider, response):
            try:
                for result in function(spider, response):
                    yield result
            except IndexError as e:
                logging.log(logging.ERROR, "Debug HTML parsing error: %s" % (unicode(response.body, 'utf-8')))
                RM = RetryMiddleware(spider.settings)
                yield RM._retry(response.request, e, spider)
        return parse_wrapper
    

    Then I use the decorator like this:

    @handle_exceptions
    def parse(self, response):