Search code examples
scrapyportia

How do I get the least articles of a website use portia


I am using portia to crawl the article of a website, now I wonder how can I get the least article everyday, when run the portia spider?

I have a idea that to use datetime from the article, and compared with now datetime.But is there a better one?


Solution

  • Depends on how the website is structured, but if every article is in a different URL you could filter URLs already visited in previous crawls by using the deltafetch spider middleware.

    To enable install scrapylib and add this to your settings.py:

    SPIDER_MIDDLEWARES = {
        'scrapylib.deltafetch.DeltaFetch': 100,
    }
    DELTAFETCH_ENABLED = True