Search code examples
pythonscrapytwistedpytest

Scrapy raises ReactorNotRestartable when CrawlerProcess is ran twice


I have some code which looks something like this:

def run(spider_name, settings):
    runner = CrawlerProcess(settings)
    runner.crawl(spider_name)
    runner.start()
    return True

I have two py.test tests which each call run(), when the second test executes I get the following error.

    runner.start()
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/scrapy/crawler.py:291: in start
    reactor.run(installSignalHandlers=False)  # blocking call
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1242: in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1222: in startRunning
    ReactorBase.startRunning(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <twisted.internet.selectreactor.SelectReactor object at 0x10fe21588>

    def startRunning(self):
        """
            Method called when reactor starts: do some initialization and fire
            startup events.

            Don't call this directly, call reactor.run() instead: it should take
            care of calling this.

            This method is somewhat misnamed.  The reactor will not necessarily be
            in the running state by the time this method returns.  The only
            guarantee is that it will be on its way to the running state.
            """
        if self._started:
            raise error.ReactorAlreadyRunning()
        if self._startedBefore:
>           raise error.ReactorNotRestartable()
E           twisted.internet.error.ReactorNotRestartable

I get this reactor thing is already running so I cannot runner.start() when the second test runs. But is there some way to reset its state inbetween the tests? So they are more isolated and actually can run after one another.


Solution

  • If you use CrawlerRunner instead of CrawlerProcess in conjunction with pytest-twisted, you should be able to use run your tests like this:

    Install Twisted integration for Pytest: pip install pytest-twisted

    from scrapy.crawler import CrawlerRunner
    
    def _run_crawler(spider_cls, settings):
        """
        spider_cls: Scrapy Spider class
        settings: Scrapy settings
        returns: Twisted Deferred
        """
        runner = CrawlerRunner(settings)
        return runner.crawl(spider_cls)     # return Deferred
    
    
    def test_scrapy_crawler():
        deferred = _run_crawler(MySpider, settings)
    
        @deferred.addCallback
        def _success(results):
            """
            After crawler completes, this function will execute.
            Do your assertions in this function.
            """
    
        @deferred.addErrback
        def _error(failure):
            raise failure.value
    
        return deferred
    

    To put it plainly, _run_crawler() will schedule a crawl in the Twisted reactor and execute callbacks when the scrape completes. In those callbacks (_success() and _error()) is where you will do your assertions. Lastly, you have to return the Deferred object from _run_crawler() so that the test waits until the crawl is complete. This part with the Deferred, is essential and must be done for all tests.

    Here's an example of how to run multiple crawls and aggregate results using gatherResults.

    from twisted.internet import defer
    
    def test_multiple_crawls():
        d1 = _run_crawler(Spider1, settings)
        d2 = _run_crawler(Spider2, settings)
    
        d_list = defer.gatherResults([d1, d2])
    
        @d_list.addCallback
        def _success(results):
            assert True
    
        @d_list.addErrback
        def _error(failure):
            assert False
    
        return d_list
    

    I hope this helps, if it doesn't please ask where you're struggling.