I write a crawler using twisted and its deferredGenerator. The following is the code about my questions :
@defer.deferredGenerator
def getReviewsFromPage(self,title,params):
def deferred1(page):
d = defer.Deferred()
reactor.callLater(1,d.callback,self.parseReviewJson(page))
return d
def deferred2(dataL,title):
d = defer.Deferred()
reactor.callLater(1,d.callback,self.writeToCSV(dataL,title=title))
return d
cp = 1
#for cp in range(1,15000):
while self.running:
print cp
params["currentPageNum"] = cp
url = self.generateReviewUrl(self.urlPrefix,params = params)
print url
wfd = defer.waitForDeferred(getPage(url,timeout=10))
yield wfd
page = wfd.getResult()
wfd = defer.waitForDeferred(deferred1(page))
yield wfd
dataList = wfd.getResult()
wfd = defer.waitForDeferred(deferred2(dataList,title))
yield wfd
cp = cp+1
And I use the generator by
self.getReviewsFromPage(title,params)
reactor.run()
My question is : When function 'getPage' get a timeout Error, what can I do to handle the Error and crawl the error page again? I added an addErrback to getPage once and wanted to recall getPage, but it seems that when reactor is running, it won't receive new event any more.
Has any of you occured to the same question? I do appreciate your help
it seems that when reactor is running, it won't receive new event any more.
This isn't the case. Events only happen when the reactor is running!
You didn't share the version of the code that uses addErrback
, so I can't see if there was a problem in how you were using it. However, since you're already using deferredGenerator
, a more idiomatic approach would be:
page = None
for i in range(numRetries):
wfd = defer.waitForDeferred(getPage(url,timeout=10))
yield wfd
try:
page = wfd.getResult()
except TimeoutError:
# Do nothing, let the loop continue
pass
else:
# Success, exit the loop
break
if page is None:
# Handle the timeout for real
...
else:
# Continue processing
...