Search code examples
pythonscrapyweb-crawlertwistedproject-reactor

Scrapy ReactorNotRestartable with a Certain Reactor


I am trying to schedule Scrapy using celery, and ran into the common ReactorNotRestartable error. These past threads have discussed this error.

ReactorNotRestartable - Twisted and scrapy Scrapy - Reactor not Restartable

The library that I am using requires twisted.internet.asyncioreactor.AsyncioSelectorReactor instead of the default one. If I follow the examples, my code stops because the requested reactor doesn't match the running reactor. I've tried modifying it to use the proper reactor, but I'm still getting the same reactor doesn't match exception.

from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue

def run_spider(spider, domain=None, check=None):
    def f(q):
        try:
            configure_logging()
            runner = CrawlerRunner(get_project_settings())
            deferred = runner.crawl(spider, domain=domain, check=check)
            deferred.addBoth(lambda _: reactor.stop())
            reactor = AsyncioSelectorReactor()
            reactor.run()
            q.put(None)
        except Exception as e:
            print("EXCEPTION!")
            q.put(e)
    
    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/code/scrapy_parsing/scripts/run_spider.py", line 203, in crawl
    yield runner.crawl(spider)
  File "/usr/local/lib/python3.10/site-packages/scrapy/crawler.py", line 232, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.10/site-packages/scrapy/crawler.py", line 266, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.10/site-packages/scrapy/crawler.py", line 271, in _create_crawler
    return Crawler(spidercls, self.settings)
  File "/usr/local/lib/python3.10/site-packages/scrapy/crawler.py", line 103, in __init__
    verify_installed_reactor(reactor_class)
  File "/usr/local/lib/python3.10/site-packages/scrapy/utils/reactor.py", line 138, in verify_installed_reactor
    raise Exception(msg)
Exception: The installed reactor (twisted.internet.epollreactor.EPollReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)```

Solution

  • In the body of your function you have:

                reactor = AsyncioSelectorReactor()
                reactor.run()
    

    Generally, you're not supposed to instantiate a Twisted reactor (at all, like this or otherwise). Instead, you're supposed to use its install function:

    from twisted.internet import asyncioreactor
    asyncioreactor.install()
    from twisted.internet import reactor
    

    I'm guessing you tried to make a new reactor in the inner loop because you want to start and stop the reactor multiple times, driven by something scrapy is doing. In general, Twisted doesn't support this usage. Your inner loop is run by multiprocessing.Process which is perhaps an attempt to get around this limitation of Twisted - by making each iteration of your inner loop run in a new process, where no Twisted reactor has been started or stopped before. In general, I wouldn't expect every part of Twisted to work properly when driven by multiprocessing.Process. Process management is complex and there hasn't been much effort to make these two things work properly together. However, some limited subset of functionality may work (but you'll probably have to discover what this subset is yourself).

    Complicating all of this is the fact that scrapy itself also uses Twisted so you have Twisted being used in the parent process and the child processes. Because of this, it's not clear to me whether the child process inherits a reactor initialized in the parent process or if it will be able to start a brand new one unencumbered by what the parent has done.

    If the child process is really a clean slate, then you should make the top lines of f do the reactor installation I suggested above:

    from twisted.internet import asyncioreactor
    asyncioreactor.install()
    from twisted.internet import reactor
    

    If it is not a clean slate then you could make the parent process also use this reactor, by putting the exact same lines at the top of your overall program. However, this probably just gets you back to ReactorNotRestartable or something similar because the reactor the parent installs/starts will get in the way of installing/starting a reactor in the child process.

    The even more general answer is that when using Twisted, you should structure your whole program so that the reactor only needs to be started and stopped once and any loops are run in between those two events. When integrating Twisted into an existing application this is sometimes difficult but https://pypi.org/project/crochet/ can help make it easier by letting you avoid restructuring the whole program and instead pushing all of the Twisted-based activity into a separate thread where the reactor can be started and stopped just once.