I am using scrapyd to run multiple spiders as jobs across the same domain. I assumed scrapy had a hashtable of visited urls that it shared and co-ordinated with other spiders when it crawled. When I create instances of the same spider by
curl http://localhost:6800/schedule.json -d project=projectname -d spider=spidername.
it rather crawls the same urls and duplicate data is being scraped. Has someone dealt with a similar problem before?
My advice would be to try to divide the site into multiple start_urls
. Then, you can pass the different values for start_urls
to each spider.
If you want to get particularly fancy, (or if the pages you want to crawl change on a regular basis) you could create a spider that crawls the sitemap, divides the links up into n
cunks, then starts n
other spiders to actually crawl the site...