I have built the scrapy spider which takes user_id as a command line argument and get the urls from a database. Now I want my application to be scalable.
Meanwhile I was looking at some of the solutions provided on the internet but not all exactly match with my requirements as in some suggests to pass a bunch of url to scrapy and do the scraping and other suggests use root url and leave everything to Scrapy, but my use case is completely different. I'm looking for the approach here.
Instead of distributing URLs if I could distribute the client IDs across the spiders that would also be fine.
You could use the Scrapinghub Cloud for that. Scrapy spiders work out-of-box on it, and you could use its Collection API to store yours user_id
for the spider to consume.
There is a free tier if you wish to test it.
But if you want to try a self-hosted solution, you could try Frontera
Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.
Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritizes links extracted by the crawler to decide which pages to visit next, and capable of doing it in a distributed manner.
Main features
(...)
Built-in Apache Kafka and ZeroMQ message buses.