I have an Scrapy Spider
that I need to run when a Tornado get
request is called. The first time I called the Tornado
Request, the spider runs ok, but when I make another request to the Tornado
, the spider does not run and the following error is raised:
Traceback (most recent call last):
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/tornado/web.py", line 1413, in _execute
result = method(*self.path_args, **self.path_kwargs)
File "server.py", line 38, in get
process.start()
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/scrapy/crawler.py", line 251, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 1193, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 1173, in startRunning
ReactorBase.startRunning(self)
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
ReactorNotRestartable
The tornado method is:
class PageHandler(tornado.web.RequestHandler):
def get(self):
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'ITEM_PIPELINES' : {'__main__.ResultsPipeline': 1}
})
process.crawl(YourSpider)
process.start()
self.write(json.dumps(results))
So the idea is that always that DirectoryHandler
method is called, the spider runs and perform the crawling.
Well after googled a lot of time, finally I get the answer to solve this problem... There is a library scrapydo (https://github.com/darkrho/scrapydo) that is based on croched and block the reactor for you allowing the reuse of the same spider every time is needed.
So to solve the problem you need to install the library, then call the setup method one time and then use the run_spider method... The code is like:
import scrapydo
scrapydo.setup()
class PageHandler(tornado.web.RequestHandler):
def get(self):
scrapydo.run_spider(YourSpider(), settings={
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'ITEM_PIPELINES' : {'__main__.ResultsPipeline': 1}
})
self.write(json.dumps(results))
Hope this could help anyone that have the same problem!