Search code examples
pythonscrapy

Run Scrapy Spiders sequential


I have several Scrapy spiders inside my spiders' directory ( let's suppose 50 spiders), now I want to run them sequential (not concurrent)

I could run them concurrent with the following code but because of some policy I've decided to run them sequentially ,

start=datetime.now()
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spiders.list():
    print ("Running spider %s" % (spider_name))
    print(spider_name)
    process.crawl(spider_name) #query dvh is custom argument used in your scrapy

process.start()
print("***********Execution time : {0}".format((datetime.now()-start)))

Also, I tried

for spider_name in process.spiders.list():
   print ("Running spider %s" % (spider_name))
   os.system("pwd ")
   os.system("pwd && scrapy crawl " + spider_name) //pwd to make sure it's in correct path and I see it is

but it seems cant run by os.system, another solution is using a .sh but I'm not sure it's a good idea. I'm looking for a solution to run spiders sequentially?


Solution

  • The Scrapy documentation has a section explaining how to run multiple spiders in the same process and also how to do this sequentially.

    For your case, this could look like this:

    from datetime import datetime
    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    from scrapy.utils.project import get_project_settings
    
    start = datetime.now()
    settings = get_project_settings()
    configure_logging(settings)
    runner = CrawlerRunner(settings)
    
    @defer.inlineCallbacks
    def crawl():
        for spider_name in runner.spider_loader.list():
            print("Running spider %s" % (spider_name))
            yield runner.crawl(spider_name)
        reactor.stop()
    
    crawl()
    reactor.run()
    print("***********Execution time : {0}".format((datetime.now()-start)))
    

    This only works if Spider Loader is properly configured. Otherwise, you can also simple list all your spiders in the crawl method one by one.