Search code examples
pythonscrapytwistedpython-multithreading

AWS lambda, scrapy and catching exceptions


I'm running scrapy as a AWS lambda function. Inside my function I need to have a timer to see whether it's running longer than 1 minute and if so, I need to run some logic. Here is my code:

def handler():
    x = 60
    watchdog = Watchdog(x)
    try:
        runner = CrawlerRunner()
        runner.crawl(MySpider1)
        runner.crawl(MySpider2)
        d = runner.join()
        d.addBoth(lambda _: reactor.stop())
        reactor.run()
    except Watchdog:
        print('Timeout error: process takes longer than %s seconds.' % x)
        # some other logic here
    watchdog.stop()

Watchdog timer class I took from this answer. The problem is the code never hits that except Watchdog block, but rather throws an exception outside:

Exception in thread Thread-1:
 Traceback (most recent call last):
   File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
     self.run()
   File "/usr/lib/python3.6/threading.py", line 1182, in run
     self.function(*self.args, **self.kwargs)
   File "./functions/python/my_scrapy/index.py", line 174, in defaultHandler
     raise self
 functions.python.my_scrapy.index.Watchdog: 1

I need to catch exception in the function. How would I go about that. PS: I'm very new to Python.


Solution

  • Alright this question had me going a little crazy, here is why that doesn't work:

    What the Watchdog object does is create another thread where the exception is raised but not handled (the exception is only handled in the main process). Luckily, twisted has some neat features.

    You can do it running the reactor in another thread:

    import time
    from threading import Thread
    from twisted.internet import reactor
    
    runner = CrawlerRunner()
    runner.crawl(MySpider1)
    runner.crawl(MySpider2)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    Thread(target=reactor.run, args=(False,)).start()  # reactor will run in a different thread so it doesn't lock the script here
    
    time.sleep(60)  # Lock script here
    
    # Now check if it's still scraping
    if reactor.running:
        # do something
    else:
        # do something else
    

    I'm using python 3.7.0