Search code examples
pythonscreen-scrapingscrapyscrapyd

Run multiple scrapy spiders at once using scrapyd


I'm using scrapy for a project where I want to scrape a number of sites - possibly hundreds - and I have to write a specific spider for each site. I can schedule one spider in a project deployed to scrapyd using:

curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2

But how do I schedule all spiders in a project at once?

All help much appreciated!


Solution

  • My solution for running 200+ spiders at once has been to create a custom command for the project. See http://doc.scrapy.org/en/latest/topics/commands.html#custom-project-commands for more information about implementing custom commands.

    YOURPROJECTNAME/commands/allcrawl.py :

    from scrapy.command import ScrapyCommand
    import urllib
    import urllib2
    from scrapy import log
    
    class AllCrawlCommand(ScrapyCommand):
    
        requires_project = True
        default_settings = {'LOG_ENABLED': False}
    
        def short_desc(self):
            return "Schedule a run for all available spiders"
    
        def run(self, args, opts):
            url = 'http://localhost:6800/schedule.json'
            for s in self.crawler.spiders.list():
                values = {'project' : 'YOUR_PROJECT_NAME', 'spider' : s}
                data = urllib.urlencode(values)
                req = urllib2.Request(url, data)
                response = urllib2.urlopen(req)
                log.msg(response)
    

    Make sure to include the following in your settings.py

    COMMANDS_MODULE = 'YOURPROJECTNAME.commands'
    

    Then from the command line (in your project directory) you can simply type

    scrapy allcrawl