Search code examples
pythonscrapytwistedreactor

Running Scrapy periodically from python results in ReactorAlreadyRunning


After a few hours of tinkering and trying out snippets, I found in stackoverflow, I finally managed to run scrapy periodically:

timeout = 60.0 # seconds

class UrlCrawlerScript(Process):
    def __init__(self, spider):
        Process.__init__(self)
        settings = get_project_settings()
        self.crawler = Crawler(settings)

        if not hasattr(project, 'crawler'):
            self.crawler.install()
            self.crawler.configure()
            self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        self.spider = spider

    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        reactor.run()


def run_spider():
    spider = MarketSpider()
    crawler = UrlCrawlerScript(spider)
    crawler.start()
    crawler.join()
    print 'finished'


l = task.LoopingCall(run_spider)
l.start(timeout) # call every sixty seconds

reactor.run()

My problem is, that I still get ReactorAlreadyRunning after the second run. How can I fix this?


Solution

  • Notice that your program calls reactor.run in two places - and one of the places is called repeatedly, effectively in a loop (because it is called (indirectly) by LoopingCall).

    Twisted's reactors are not restartable. You can run and stop them once. If you try to run them again, then you are given an exception. If you try to run them while they're running, then you get another exception - ReactorAlreadyRunning - as you have seen.

    The solution here is to run the reactor only once. Consequently, you should stop the reactor only once as well.

    At a minimum, this means you should only call reactor.run from one place in your program. I suggest that, as a start, the call at the very end of the program is the one you want to keep and the one inside the run method (which will be called once each time you run the spider) should be eliminated.

    You also need to avoid stopping the reactor when the spider is done. If you connect reactor.stop to spider_done then after the spider runs the first time the reactor will stop and you won't be able to run the spider again. I think you can simply delete this part of your program.