Search code examples
pythonscrapyscrapy-middleware

Scrapy appears to be deduplicating the first request when it is processed with DownloaderMiddleware


I've got a certain spider which inherits from SitemapSpider. As expected, the first request on startup is to sitemap.xml of my website. However, for it to work correctly I need to add a header to all the requests, including the initial ones which fetch the sitemap. I do so with DownloaderMiddleware, like this:

def process_request(self, request: scrapy.http.Request, spider):
    if "Host" in request.headers:
        return None

    host = request.url.removeprefix("https://").removeprefix("http://").split("/")[0]
    request.headers["Host"] = host
    spider.logger.info(f"Got {request}")
    return request

However, looks like Scrapy's request deduplicator is stopping this request from going through. In my logs I see something like this:

2021-10-16 21:21:08 [ficbook-spider] INFO: Got <GET https://mywebsite.com/sitemap.xml>
2021-10-16 21:21:08 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://mywebsite.com/sitemap.xml> 

Since spider.logger.info in process_request is triggered only once, I presume that it is the first request, and, after processing, it gets deduplicated. I thought that, maybe, deduplication is triggered before DownloaderMiddleware (that would explain that the request is deduplicated without a second "Got ..." in logs), however, I don't think that's true for two reasons:

  • I looked through the code of SitemapSpider, and it appears to fetch the sitemap.xml only once
  • If it did, in fact, fetch it before, I'd expect it to do something - instead it just stops the spider, since no pages were enqueued for processing

Why does this happen? Did I make some mistake in process_request?


Solution

  • It won't do something with the first response and neither fetch a second response since you are returning a new request from your custom DownloaderMiddleware process_request function which is being filtered out. From the docs:

    If it returns a Request object, Scrapy will stop calling process_request methods and reschedule the returned request. Once the newly returned request is performed, the appropriate middleware chain will be called on the downloaded response.

    It might work if you explicitly say to not filter your second request.

    def process_request(self, request: scrapy.http.Request, spider):
        if "Host" in request.headers:
            return None
    
        host = request.url.removeprefix("https://").removeprefix("http://").split("/")[0]
        new_req = request.replace(dont_filter=True)
        new_req.headers["Host"] = host
        spider.logger.info(f"Got {new_req}")
        return new_req