Search code examples
pythonscrapytwisted

Scrapy: Run spiders seqential with different settings for each spider


For quite a few days now I'm having trouble with Scrapy/twisted in my Main.py which is supposed to run different spiders and analyze their outputs. Unfortunately, MySpider2 relies on the FEED from MySpider1 and therefore can only run after MySpider1 is finished. Furthermore, MySpider1 and MySpider2 have different settings. So far, I have not found a solution that lets me run the spiders sequentially with their respective unique settings. I have looked at the Scrapy CrawlerRunner and CrawlerProcess docs, and experimented with several related stackoverflow questions (Run Multiple Spider sequentially, Scrapy: how to run two crawlers one after another?, Scrapy run multiple spiders from a script, and others) without success.

Following the documentation on sequential spiders, my (slightly adapted) code would be:

from MySpider1.myspider1.spiders.myspider1 import MySpider1
from MySpider2.myspider2.spiders.myspider2 import MySpider2
from twisted.internet import defer, reactor
from scrapy.crawler import CrawlerProcess
from scrapy.crawler import CrawlerRunner

spider_settings = [{
    'FEED_URI':'abc.csv',
    'LOG_FILE' :'abc/log.log'
    #MORE settings are here
    },{
    'FEED_URI' : '123.csv',
    'LOG_FILE' :'123/log.log'
    #MORE settings are here
    }]

spiders = [MySpider1, MySpider2]

process = CrawlerRunner(spider_settings[0])
process = CrawlerRunner(spider_settings[1]) #Not sure if this is how its supposed to be used for
#multiple settings but passing this line before "yield process.crawl(spiders[1])" also results in an error.

@defer.inlineCallbacks
def crawl():
    yield process.crawl(spiders[0])
    yield process.crawl(spiders[1])
    reactor.stop()
crawl()
reactor.run()

However, with this code, only the first spider gets executed and without any settings. Therefore, I have tried to use CrawlerProcess with a little more effect:

from MySpider1.myspider1.spiders.myspider1 import MySpider1
from MySpider2.myspider2.spiders.myspider2 import MySpider2
from twisted.internet import defer, reactor
from scrapy.crawler import CrawlerProcess
from scrapy.crawler import CrawlerRunner

spider_settings = [{
    'FEED_URI':'abc.csv',
    'LOG_FILE' :'abc/log.log'
    #MORE settings are here
    },{
    'FEED_URI' : '123.csv',
    'LOG_FILE' :'123/log.log'
    #MORE settings are here
    }]

spiders = [MySpider1, MySpider2]

process = CrawlerProcess(spider_settings[0])
process = CrawlerProcess(spider_settings[1])

@defer.inlineCallbacks
def crawl():
    yield process.crawl(spiders[0])
    yield process.crawl(spiders[1])
    reactor.stop()
crawl()
reactor.run()

This code executes both spiders but simultaneously and not sequentially as intended. Furthermore, it also overwrites the settings of spider[0] with the ones of spider[1] after a second causing the log file to be cut off after only two lines and resumed for both spiders at 123/log.log.

In a perfect world my snippet would work as follows:

  1. Run spider[0] with spider_settings[0]
  2. Wait until spider[0] is finished.
  3. Run spider[1] with spider_settings[1]

Thanks in advance for the help.


Solution

  • Separate the runners and it should work

    process_1 = CrawlerRunner(spider_settings[0])
    process_2 = CrawlerRunner(spider_settings[1])
    
    #...
    
    @defer.inlineCallbacks
    def crawl():
        yield process_1.crawl(spiders[0])
        yield process_2.crawl(spiders[1])
        reactor.stop()
    
    #...