Search code examples
pythonscrapytwisted

How to provide a twisted reactor which is currently running?


We are trying to program a bot which claim to crawl articles from a newspaper thanks it's rss feeds. So we want that our script could repeat this steps several times per day:

1) look at the rss feeds we have listed

2) identify articles we haven't crawled yet

3) add the links to a list of urls to crawl

4) crawl the urls listed

We achieve to execute these steps one time with this code:

rss_feeds_lemonde = [
'http://www.lemonde.fr/rss/une.xml',
'http://www.lemonde.fr/international/rss_full.xml',
'http://www.lemonde.fr/politique/rss_full.xml',
]

db = sqlite3.connect('newspaper_db')
cursor = db.cursor()
urls = []
already_met = False
site = 'lemonde'

for rss_feed in rss_feeds_lemonde:
    parsed_rss_feed = feedparser.parse(rss_feed)
    for post in parsed_rss_feed.entries:
        url = post.link
        if url.split('.')[1] == site:
            cursor.execute('''SELECT url FROM articles WHERE newspaper = site''')
            rows = cursor.fetchall()
            for row in rows:
                if row[0] == url:
                    already_met = True
            if already_met == False:
                cursor.execute('''INSERT INTO articles(url, newspaper) VALUES(?,?)''', (url, site))
                urls.append(url)
            else:
                already_met = False

cursor.close()
db.commit()
db.close()
if urls != []:
    process = CrawlerProcess()
    process.crawl(LeMondeSpider, start_urls = urls)
    process.start()

The problem is that the twisted reactor is not restartable so it allows us to execute our steps once. Is it possible to pause the reactor and unpause it after we provide a new list of urls to crawl? Have we other solutions?

[edit] for notorious.no, this example works fine now thanks to you !

def run_when_crawl_done(null):
    time.sleep(10)
    urls = [
    'http://www.lefigaro.fr/elections/presidentielles/2017/05/05/35003-20170505ARTFIG00129-comment-ils-veulent-bloquer-le-pen-sans-soutenir-macron-ce-dimanche.php',
    'http://www.lefigaro.fr/elections/presidentielles/2017/05/04/35003-20170504ARTFIG00259-si-marine-le-pen-atteint-40-ca-serait-deja-une-enorme-victoire-dit-sa-niece.php',
    'http://www.lefigaro.fr/elections/presidentielles/2017/05/04/35003-20170504ARTFIG00126-emmanuel-macron-non-je-n-ai-pas-de-compte-aux-bahamas.php',
    ]
    deffered = runner.crawl(LeFigaroSpider, start_urls = urls)
    deffered.addCallback(lambda _: reactor.stop())

urls = [
'http://www.lemonde.fr/les-decodeurs/article/2017/04/26/europe-macron-emploi-la-trumpisation-de-marine-le-pen-sur-tf1_5117479_4355770.html',
'http://www.lemonde.fr/syrie/article/2017/04/26/attaque-chimique-la-france-avance-ses-preuves-contre-damas_5117652_1618247.html',
]

if urls != []:
    configure_logging()
    runner = CrawlerRunner()
    deferred = runner.crawl(LeMondeSpider, start_urls = urls)
    deferred.addCallback(run_when_crawl_done)
    reactor.run()

Solution

  • Twisted's reactor is indeed unable to restart. If you think about it for a minute, you will realize that stopping an event loop, only to have another event start it back up, is counter intuitive. Most event driven apps are "long running" and should not stop unless something is severely wrong.

    Do not start-stop-restart event loops. Start the app and then never restart it (you're making a bot so I assume that the bot never sleeps). Use CrawlerRunner instead of CrawlerProcess then execute reactor.run(). This allows a bit more flexibility and allows you to run more tasks concurrently.

    def run_when_crawl_done(null):
        """
        logic that will be executed after the crawl is done
        """
    
    if urls:
        runner = CrawlerRunner()
        deferred = runner.crawl(LeMondeSpider, start_urls=urls)
        deferred.addCallback(run_when_crawl_done)
        reactor.run()