Search code examples
python-3.xscrapyscrapy-pipeline

How to yield item from RFPDupeFilter or CustomFiler


I'm using Scrapy to crawl pages from different websites. With every scrapy.Request() I set some meta data which is used to yield an item. It's also possible that my code yields multiple scrapy.Request() for same url however with different meta.

yield scrapy.Request(url='http://www.example.com', meta={'some_field': 'some_value'} ..)

Now I can set dont_filter=True and scrapy won't block the duplicate request.

yield scrapy.Request(url='http://www.example.com', meta={'some_other_field': 'some_other_value'}, dont_filter=True, ..)

However, since for duplicate requests I'm only interested in metadata set on scrapy.Request(), I want to yield an Item from RFPDupeFilter or CustomDupFilter so it will be written to JSON by the item pipeline.

    class CustomDupFilter(BaseDupeFilter):

        def request_seen(self, request: Request) -> bool:
            fp = self.request_fingerprint(request)
            if fp in self.fingerprints:
                yield request.meta['some_other_value'] # yield metadata as Item
                self.fingerprints.add(fp)
                return True
            else:
                return False

Any help is much appreciated.


Solution

  • I don't think you can yield items in Dupefilter, but I think one way around this is to disable the Filter and handle duplicate requests in custom spider middleware. Maybe something like this:

    class DupeFilterMiddleware:
        seen_requests = set()
    
        def process_spider_output(self, response, result, spider):
            for output in result:
                if isinstance(output, scrapy.Request) and fingerprint(output) in self.seen_requests:
                    # yield from meta
                elif isinstance(output, scrapy.Request):
                    self.seen_requests.add(fingerprint(output))
                    yield output
                else:
                    yield output