So, I made this class so that I can crawl on-demand using Scrapy:
from scrapy import signals
from scrapy.crawler import CrawlerProcess, Crawler
from scrapy.settings import Settings
class NewsCrawler(object):
def __init__(self, spiders=[]):
self.spiders = spiders
self.settings = Settings()
def crawl(self, start_date, end_date):
crawled_items = []
def add_item(item):
crawled_items.append(item)
process = CrawlerProcess(self.settings)
for spider in self.spiders:
crawler = Crawler(spider, self.settings)
crawler.signals.connect(add_item, signals.item_scraped)
process.crawl(crawler, start_date=start_date, end_date=end_date)
process.start()
return crawled_items
Basically, I have a long running process and I will call the above class' crawl
method multiple times, like this:
import time
crawler = NewsCrawler(spiders=[Spider1, Spider2])
while True:
items = crawler.crawl(start_date, end_date)
# do something with crawled items ...
time.sleep(3600)
The problem is, the second time crawl
being called, this error will occurs: twisted.internet.error.ReactorNotRestartable
.
From what I gathered, it's because reactor can't be run after it's being stopped. Is there any workaround for that?
Thanks!
This is a limitation of scrapy(twisted) at the moment and makes it hard using scrapy as a lib.
What you can do is fork a new process which runs the crawler and stops the reactor when the crawl is finished. You can then wait for join and spawn a new process after the crawl has finished. If you want to handle the items in your main thread you can post the results to a Queue. I would recommend using a customized pipelines for your items though.
Have a look at the following answer by me: https://stackoverflow.com/a/22202877/2208253
You should be able to apply the same principles. But you would rather use multiprocessing instead of billiard.