I'm using Scrapy to crawl pages from different websites. With every scrapy.Request()
I set some meta data which is used to yield an item. It's also possible that my code yields multiple scrapy.Request()
for same url however with different meta.
yield scrapy.Request(url='http://www.example.com', meta={'some_field': 'some_value'} ..)
Now I can set dont_filter=True
and scrapy won't block the duplicate request.
yield scrapy.Request(url='http://www.example.com', meta={'some_other_field': 'some_other_value'}, dont_filter=True, ..)
However, since for duplicate requests I'm only interested in metadata set on scrapy.Request()
, I want to yield an Item from RFPDupeFilter
or CustomDupFilter
so it will be written to JSON by the item pipeline.
class CustomDupFilter(BaseDupeFilter):
def request_seen(self, request: Request) -> bool:
fp = self.request_fingerprint(request)
if fp in self.fingerprints:
yield request.meta['some_other_value'] # yield metadata as Item
self.fingerprints.add(fp)
return True
else:
return False
Any help is much appreciated.
I don't think you can yield items in Dupefilter, but I think one way around this is to disable the Filter and handle duplicate requests in custom spider middleware. Maybe something like this:
class DupeFilterMiddleware:
seen_requests = set()
def process_spider_output(self, response, result, spider):
for output in result:
if isinstance(output, scrapy.Request) and fingerprint(output) in self.seen_requests:
# yield from meta
elif isinstance(output, scrapy.Request):
self.seen_requests.add(fingerprint(output))
yield output
else:
yield output