So I got status 503 when I crawl. It's retried, but then it gets ignored. I want it to be marked as an error, not ignored. How to do that?
I prefer to set it in settings.py
so it would apply to all of my spiders. handle_httpstatus_list
seems will only affect one spider.
In the end, I overwrite the retry middleware just for a small change. I set so every time the scraper gave up retrying on something, doesn't matter what is the status code, it will be marked as an error.
It seems Scrapy somehow doesn't associate giving up retrying as an error. That's weird for me.
This is the middleware if anyone wants to use it. Don't forget to activate it on the settings.py
from scrapy.downloadermiddlewares.retry import *
class Retry500Middleware(RetryMiddleware):
def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1
if retries <= self.max_retry_times:
logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
{'request': request, 'retries': retries, 'reason': reason},
extra={'spider': spider})
retryreq = request.copy()
retryreq.meta['retry_times'] = retries
retryreq.dont_filter = True
retryreq.priority = request.priority + self.priority_adjust
return retryreq
else:
# This is the point where I update it. It used to be `logger.debug` instead of `logger.error`
logger.error("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
{'request': request, 'retries': retries, 'reason': reason},
extra={'spider': spider})