Search code examples
web-scrapingscrapyhttp-status-codesscrapy-settings

How to mark scrape failed because of 503 as error in Scrapy?


So I got status 503 when I crawl. It's retried, but then it gets ignored. I want it to be marked as an error, not ignored. How to do that?

I prefer to set it in settings.py so it would apply to all of my spiders. handle_httpstatus_list seems will only affect one spider.


Solution

  • In the end, I overwrite the retry middleware just for a small change. I set so every time the scraper gave up retrying on something, doesn't matter what is the status code, it will be marked as an error.

    It seems Scrapy somehow doesn't associate giving up retrying as an error. That's weird for me.

    This is the middleware if anyone wants to use it. Don't forget to activate it on the settings.py

    from scrapy.downloadermiddlewares.retry import *
    
    class Retry500Middleware(RetryMiddleware):
    
        def _retry(self, request, reason, spider):
            retries = request.meta.get('retry_times', 0) + 1
    
            if retries <= self.max_retry_times:
                logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
                             {'request': request, 'retries': retries, 'reason': reason},
                             extra={'spider': spider})
                retryreq = request.copy()
                retryreq.meta['retry_times'] = retries
                retryreq.dont_filter = True
                retryreq.priority = request.priority + self.priority_adjust
                return retryreq
            else:
                # This is the point where I update it. It used to be `logger.debug` instead of `logger.error`
                logger.error("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
                             {'request': request, 'retries': retries, 'reason': reason},
                             extra={'spider': spider})