Search code examples
pythonscrapytornado

How to make a scrapy spider run multiple times from a tornado request


I have an Scrapy Spider that I need to run when a Tornado get request is called. The first time I called the Tornado Request, the spider runs ok, but when I make another request to the Tornado, the spider does not run and the following error is raised:

Traceback (most recent call last):
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/tornado/web.py", line 1413, in _execute
        result = method(*self.path_args, **self.path_kwargs)
    File "server.py", line 38, in get
        process.start()
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/scrapy/crawler.py", line 251, in start
        reactor.run(installSignalHandlers=False)  # blocking call
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 1193, in run
        self.startRunning(installSignalHandlers=installSignalHandlers)
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 1173, in startRunning
        ReactorBase.startRunning(self)
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
        raise error.ReactorNotRestartable()
ReactorNotRestartable

The tornado method is:

class PageHandler(tornado.web.RequestHandler):

    def get(self):

        process = CrawlerProcess({
            'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
            'ITEM_PIPELINES' : {'__main__.ResultsPipeline': 1}
        })

        process.crawl(YourSpider)
        process.start()

        self.write(json.dumps(results))

So the idea is that always that DirectoryHandler method is called, the spider runs and perform the crawling.


Solution

  • Well after googled a lot of time, finally I get the answer to solve this problem... There is a library scrapydo (https://github.com/darkrho/scrapydo) that is based on croched and block the reactor for you allowing the reuse of the same spider every time is needed.

    So to solve the problem you need to install the library, then call the setup method one time and then use the run_spider method... The code is like:

    import scrapydo
    scrapydo.setup()
    
    
    class PageHandler(tornado.web.RequestHandler):
    
        def get(self):
    
            scrapydo.run_spider(YourSpider(), settings={
                'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
                'ITEM_PIPELINES' : {'__main__.ResultsPipeline': 1}
            })
    
            self.write(json.dumps(results))
    

    Hope this could help anyone that have the same problem!