Search code examples
pythonweb-scrapingscrapysequential

How to pass data between sequential spiders


I have two spiders that run in sequential order according to https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process. Now I want to pass some information from the first spider to the second (a selenium webdriver, or it's session information).

I'm quite new to scrapy, but on another post it was proposed to save the data to a db and retrieve it from there. This seems a bit too much for just passing one variable, is there no other way? (I know in this example I could just make that into one long spider, but later I would like to run the first spider once but the second spider multiple times.)

class Spider1(scrapy.Spider):
    # Open a webdriver and get session_id

class Spider2(scrapy.Spider):
    # Get the session_id  and run spider2 code
    def __init__(self, session_id):
        ...

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1)
    # TODO How to get the session_id?
    # session_id = yield runner.crawl(Spider1) returns None
    # Or adding return statement in Spider 1, actually breaks 
    # sequential processing and program sleeps before running Spider1

    time.sleep(2)

    yield runner.crawl(Spider2(session_id))
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

I would like to pass the variable to the constructor of the second spider, but I'm unable to get the data from the first one. If I just run the first crawler to return the variable, it apparently breaks the sequential structure. If I try to retrieve the yield, the result is None.

Am I completely blind? I can't believe that this should be such a complex task.


Solution

  • You can pass a queue to both spiders, and let spider2 block on queue.get(), so there is no need for time.sleep(2).

    # globals.py
    
    queue = Queue()
    
    # run.py
    
    import globals
    
    
    class Spider1(scrapy.Spider):
        def __init__(self):
            # put session_id to `globals.queue` somewhere in `Spider1`, so `Spider2` can start.
            ...
    
    class Spider2(scrapy.Spider):
        def __init__(self):
            session_id = globals.queue.get()
    
    configure_logging()
    runner = CrawlerRunner()
    
    @defer.inlineCallbacks
    def crawl():
        yield runner.crawl(Spider1)
        yield runner.crawl(Spider2)
        reactor.stop()
    
    crawl()
    reactor.run()