Before you link me to other answers related to this, note that I've read them and am still a bit confused. Alrighty, here we go.
So I am creating a webapp in Django. I am importing the newest scrapy library to crawl a website. I am not using celery (I know very little about it, but saw it in other topics related to this).
One of the url's of our website, /crawl/, is meant to start the crawler running. It's the only url in our site that requires scrapy to be used. Here is the function which is called when the url is visited:
def crawl(request):
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(ReviewSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
return render(request, 'index.html')
You'll notice that this is an adaptation of the scrapy tutorial on their website. The first time this url is visited when the server starts running, everything works as intended. The second time and further, a ReactorNotRestartable exception is thrown. I understand that this exception happens when a reactor which has already been stopped is issued a command to start again, which is not possible.
Looking at the sample code, I would assume the line "runner = CrawlerRunner()" would return a ~new~ reactor for use each time this url is visited. But I believe perhaps my understanding of twisted reactors is not completely clear.
How would I go about getting and running a NEW reactor each time this url is visited?
Thank you so much
Generally speaking, you can't have a new reactor. There's one global one. This is clearly a mistake and maybe it will be corrected in the future but that's the current state of affairs.
You might be able to use Crochet to manage a single reactor running (for the lifetime of your whole process - not repeatedly starting and stopping) in a separate thread.
Consider the example from the Crochet docs:
#!/usr/bin/python
"""
Do a DNS lookup using Twisted's APIs.
"""
from __future__ import print_function
# The Twisted code we'll be using:
from twisted.names import client
from crochet import setup, wait_for
setup()
# Crochet layer, wrapping Twisted's DNS library in a blocking call.
@wait_for(timeout=5.0)
def gethostbyname(name):
"""Lookup the IP of a given hostname.
Unlike socket.gethostbyname() which can take an arbitrary amount of time
to finish, this function will raise crochet.TimeoutError if more than 5
seconds elapse without an answer being received.
"""
d = client.lookupAddress(name)
d.addCallback(lambda result: result[0][0].payload.dottedQuad())
return d
if __name__ == '__main__':
# Application code using the public API - notice it works in a normal
# blocking manner, with no event loop visible:
import sys
name = sys.argv[1]
ip = gethostbyname(name)
print(name, "->", ip)
This gives you a blocking gethostbyname
function that's implemented using Twisted APIs. The implementation uses twisted.names.client
which just relies on being able to import the global reactor.
Note there is no reactor.run
or reactor.stop
call - just the Crochet setup
call.