Search code examples
pythonflaskscrapyreactortwisted.internet

Python scrapy ReactorNotRestartable substitute


I have been trying to make an app in Python using Scrapy that has the following functionality:

  • A rest api (I had made that using flask) listens to all requests to crawl/scrap and return the response after crawling.(the crawling part is short enough, so the connection can be keep-alive till crawling gets completed.)

I am able to do this using the following code:

items = []
def add_item(item):
    items.append(item)

# set up crawler
crawler = Crawler(SpiderClass,settings=get_project_settings())
crawler.signals.connect(add_item, signal=signals.item_passed)

# This is added to make the reactor stop, if I don't use this, the code stucks at reactor.run() line.
crawler.signals.connect(reactor.stop, signal=signals.spider_closed) #@UndefinedVariable 
crawler.crawl(requestParams=requestParams)
# start crawling 
reactor.run() #@UndefinedVariable
return str(items)

Now the problem I am facing is after making the reactor stop (which seems necessary to me since I don't want to stuck to the reactor.run()). I couldn't accept the further request after first request. After first request gets completed, I got the following error:

Traceback (most recent call last):
  File "c:\python27\lib\site-packages\flask\app.py", line 1988, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\python27\lib\site-packages\flask\app.py", line 1641, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\python27\lib\site-packages\flask\app.py", line 1544, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "c:\python27\lib\site-packages\flask\app.py", line 1639, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\python27\lib\site-packages\flask\app.py", line 1625, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "F:\my_workspace\jobvite\jobvite\com\jobvite\web\RequestListener.py", line 38, in submitForm
    reactor.run() #@UndefinedVariable
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1193, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1173, in startRunning
    ReactorBase.startRunning(self)
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
ReactorNotRestartable

Which is obvious, since we can not restart the reactor.

So my questions are:

1) How could I provide support for the next requests to crawl?

2) Is there any way to move to next line after reactor.run() without stopping it?


Solution

  • Here is a simple solution to your problem

    from flask import Flask
    import threading
    import subprocess
    import sys
    app = Flask(__name__) 
    
    class myThread (threading.Thread):
        def __init__(self,target):
            threading.Thread.__init__(self)
            self.target = target
        def run(self):
            start_crawl()
    
    def start_crawl():
        pid = subprocess.Popen([sys.executable, "start_request.py"])
        return
    
    
    @app.route("/crawler/start") 
    def start_req():
        print ":request"
        threadObj = myThread("run_crawler")
        threadObj.start()
        return "Your crawler is in running state" 
    if (__name__ == "__main__"): 
        app.run(port = 5000)
    

    In the above solution I assume that you are able to start your crawler from command line using command start_request.py file on shell/command line.

    Now what we are doing is using threading in python to launch a new thread for each incoming request. Now you can easily able to run your crawler instance in parallel for each hit. Just control your number of threads using threading.activeCount()