python python-2.7 web-scraping scrapy scrapy-middleware

How to retry IndexError in Scrapy

Sometimes I get IndexError because I successfully scrape only half of the page causing the parsing logic to get IndexError. How can I retry when I get IndexError?

It's ideally a middleware so it can handle multiple spiders at once.

Solution

In the end, I use a decorator and call _retry() function from RetryMiddleware in the decorator function. It works well. It's not the best, it's best to be able to have a middleware handling it. But it's better than nothing.

from scrapy.downloadermiddlewares.retry import RetryMiddleware

def handle_exceptions(function):
    def parse_wrapper(spider, response):
        try:
            for result in function(spider, response):
                yield result
        except IndexError as e:
            logging.log(logging.ERROR, "Debug HTML parsing error: %s" % (unicode(response.body, 'utf-8')))
            RM = RetryMiddleware(spider.settings)
            yield RM._retry(response.request, e, spider)
    return parse_wrapper

Then I use the decorator like this:

@handle_exceptions
def parse(self, response):