Search code examples
pythonscrapy

scrapy stops crawling with 500 Internal Server Error


I am crawling a web with scrapy and I receive the error:

Gave up retrying <GET https://www.something.net> (failed 3 times): 500 Internal Server Error

even though in the parse method I have added this parameter to the meta of the scrapy.Request that calls the parse function:

"handle_httpstatus_all": True,

Then in the parse function I do:

item = response.meta['item']
if response.status == 200:
    #Keeps building the item
yield item

So in theory this should not happen. What can I do to avoid it?


Solution

  • Your theory is missing some vital information.

    Scrapy has two different sets of middleware that each request must pass through. The one you are referring to is the HttpErrorMiddleware which is belongs to the Spider-Middleware group. If this middleware is enabled and you set the request meta key handle_httpstatus_all to True then it does in fact allow all failed requests through to be parsed.

    However there is another group of Middleware called the Downloader-Middleware which are passed through first before the requests/responses ever reach the Spider-Middleware. Among these is the RetryMiddleware which identifies responses with certain error codes that are determined to be potentially temporary and automatically resends those requests up to a certain number of times before the response is officially considered failed.

    So your theory is still accurate in the sense that all failed responses are allowed to go through, but for some error codes they first go through a few retry attempts before they get processed.

    You can customize the middleware's behavior by setting the number of retries with max_retry_times meta key to a custom value, or setting the dont_retry meta key to True, or you can disable the retry middleware altogether in your settings with RETRY_ENABLED = False .

    You can also customize which error codes are considered eligible for retry with the RETRY_HTTP_CODES setting.