I've got a certain spider which inherits from SitemapSpider
. As expected, the first request on startup is to sitemap.xml
of my website. However, for it to work correctly I need to add a header to all the requests, including the initial ones which fetch the sitemap. I do so with DownloaderMiddleware, like this:
def process_request(self, request: scrapy.http.Request, spider):
if "Host" in request.headers:
return None
host = request.url.removeprefix("https://").removeprefix("http://").split("/")[0]
request.headers["Host"] = host
spider.logger.info(f"Got {request}")
return request
However, looks like Scrapy's request deduplicator is stopping this request from going through. In my logs I see something like this:
2021-10-16 21:21:08 [ficbook-spider] INFO: Got <GET https://mywebsite.com/sitemap.xml>
2021-10-16 21:21:08 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://mywebsite.com/sitemap.xml>
Since spider.logger.info
in process_request
is triggered only once, I presume that it is the first request, and, after processing, it gets deduplicated. I thought that, maybe, deduplication is triggered before DownloaderMiddleware (that would explain that the request is deduplicated without a second "Got ..." in logs), however, I don't think that's true for two reasons:
Why does this happen? Did I make some mistake in process_request
?
It won't do something with the first response and neither fetch a second response since you are returning a new request from your custom DownloaderMiddleware process_request
function which is being filtered out. From the docs:
If it returns a Request object, Scrapy will stop calling process_request methods and reschedule the returned request. Once the newly returned request is performed, the appropriate middleware chain will be called on the downloaded response.
It might work if you explicitly say to not filter your second request.
def process_request(self, request: scrapy.http.Request, spider):
if "Host" in request.headers:
return None
host = request.url.removeprefix("https://").removeprefix("http://").split("/")[0]
new_req = request.replace(dont_filter=True)
new_req.headers["Host"] = host
spider.logger.info(f"Got {new_req}")
return new_req