Search code examples
pythonscrapyweb-crawlerreactor

Scrapy crawl multiple times in long running process


So, I made this class so that I can crawl on-demand using Scrapy:

from scrapy import signals
from scrapy.crawler import CrawlerProcess, Crawler
from scrapy.settings import Settings


class NewsCrawler(object):

    def __init__(self, spiders=[]):
        self.spiders = spiders
        self.settings = Settings()

    def crawl(self, start_date, end_date):
        crawled_items = []

        def add_item(item):
            crawled_items.append(item)

        process = CrawlerProcess(self.settings)

        for spider in self.spiders:
            crawler = Crawler(spider, self.settings)
            crawler.signals.connect(add_item, signals.item_scraped)
            process.crawl(crawler, start_date=start_date, end_date=end_date)

        process.start()

        return crawled_items

Basically, I have a long running process and I will call the above class' crawl method multiple times, like this:

import time


crawler = NewsCrawler(spiders=[Spider1, Spider2])

while True:
    items = crawler.crawl(start_date, end_date)
    # do something with crawled items ...
    time.sleep(3600)

The problem is, the second time crawl being called, this error will occurs: twisted.internet.error.ReactorNotRestartable.

From what I gathered, it's because reactor can't be run after it's being stopped. Is there any workaround for that?

Thanks!


Solution

  • This is a limitation of scrapy(twisted) at the moment and makes it hard using scrapy as a lib.

    What you can do is fork a new process which runs the crawler and stops the reactor when the crawl is finished. You can then wait for join and spawn a new process after the crawl has finished. If you want to handle the items in your main thread you can post the results to a Queue. I would recommend using a customized pipelines for your items though.

    Have a look at the following answer by me: https://stackoverflow.com/a/22202877/2208253

    You should be able to apply the same principles. But you would rather use multiprocessing instead of billiard.