Search code examples
c#.netscrapyscreen-scrapingasp.net-core-signalr

Best Method to Push Data from Scrapy to .Net Application


Best Method/Idea to push scraped data from Scrapy crawlers to .Net Application

Setup:

  1. Debian server runs a scrapy server
  2. Windows server run a .Net Core application server

I am thinking about adding a RESTful API to my .Net Core Service and push item data there from Scrapy on every crawler "finished" event.

Basically I want to have kind of "push-notifications" from Scrapy server to my .Net app when new data item is scraped.

What is the best place to put that call to an external API in scrapy?


Solution

  • You have multiple options here. Pushing data is indeed the easiest solution, though make sure to authorize the requests you make to your API. You can use the item_scraped signal to invoke your requests for every scraped item. Keep in mind that in case there are hundreds of scraped items, it might put a lot of stress on your API which is something you should avoid. You can wait until the scraper has finished and then invoke your API with a single request. Some alternative solutions:

    • Put scraped items in your database and poll the database for new items in the other application
    • Use a messaging queue like RabbitMQ, AWS SQS or Kafka