I am using portia to crawl the article of a website, now I wonder how can I get the least article everyday, when run the portia spider?
I have a idea that to use datetime from the article, and compared with now datetime.But is there a better one?
Depends on how the website is structured, but if every article is in a different URL you could filter URLs already visited in previous crawls by using the deltafetch spider middleware.
To enable install scrapylib and add this to your settings.py:
SPIDER_MIDDLEWARES = {
'scrapylib.deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True