I'm running scrapy as a AWS lambda function. Inside my function I need to have a timer to see whether it's running longer than 1 minute and if so, I need to run some logic. Here is my code:
def handler():
x = 60
watchdog = Watchdog(x)
try:
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
except Watchdog:
print('Timeout error: process takes longer than %s seconds.' % x)
# some other logic here
watchdog.stop()
Watchdog timer class I took from this answer. The problem is the code never hits that except Watchdog
block, but rather throws an exception outside:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 1182, in run
self.function(*self.args, **self.kwargs)
File "./functions/python/my_scrapy/index.py", line 174, in defaultHandler
raise self
functions.python.my_scrapy.index.Watchdog: 1
I need to catch exception in the function. How would I go about that. PS: I'm very new to Python.
Alright this question had me going a little crazy, here is why that doesn't work:
What the Watchdog
object does is create another thread where the exception is raised but not handled (the exception is only handled in the main process). Luckily, twisted has some neat features.
You can do it running the reactor in another thread:
import time
from threading import Thread
from twisted.internet import reactor
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
Thread(target=reactor.run, args=(False,)).start() # reactor will run in a different thread so it doesn't lock the script here
time.sleep(60) # Lock script here
# Now check if it's still scraping
if reactor.running:
# do something
else:
# do something else
I'm using python 3.7.0