Search code examples
scrapyscrapy-pipeline

Scraping Blogs - avoid already scraped items by checking urls from json/csv in advance


I'd like to scrape newspages / blogs (anything, which contains new informations on a daily basis).

My Crawler works fine and does everything, I kindly asked him to do.

But I cannot find a proper solution to the circumstance, that I'd like him to ignore already scraped urls (or items to keep it more general) and just add new urls/items to an already existing json/csv file.

I've seen many solutions here to check, whether an item exists in a csv file.. but none of this "solutions" did really work.

Scrapy DeltaFetch seems to cannot be installed on my system... I've get errors af. and all the hints, like e.g. $ sudo pip install bsddb3, upgrade this and update that.. etc.. does not do the trick. (tried it for 3 hours now and fed up with solutionfinding for a package, which wasn't updated since 2017).

I hope, that you have a handy and practical solution.

Thank you very much in advance!

Best regards!


Solution

  • An option could be a custom downloader middleware with the following:

    • A process_response that puts the url you crawled in a database
    • A process_request method that checks if the url is present in the database. If it's in there, you raise an IgnoreRequest so the request is not going through anymore.