Search code examples
pythonscrapyscrapy-splash

How to start a Scrapy spider from another one


I have two spiders in one Scrapy project. Spider1 crawls a list of page or an entire website and analyzes the content. Spider2 uses Splash to fetch URLs on google and pass that list to Spider1.

So, Spider1 crawls and analyze content and can be used without being called by Spider2

# coding: utf8
from scrapy.spiders import CrawlSpider
import scrapy


class Spider1(scrapy.Spider):
    name = "spider1"
    tokens = []
    query = ''

    def __init__(self, *args, **kwargs):
        '''
        This spider works with two modes,
        if only one URL it crawls the entire website,
        if a list of URLs only analyze the page
        '''
        super(Spider1, self).__init__(*args, **kwargs)
        start_url = kwargs.get('start_url') or ''
        start_urls = kwargs.get('start_urls') or []
        query = kwargs.get('q') or ''
        if google_query != '':
            self.query = query
        if start_url != '':
            self.start_urls = [start_url]
        if len(start_urls) > 0:
            self.start_urls = start_urls


    def parse(self, response):
        '''
        Analyze and store data
        '''
        if len(self.start_urls) == 1:
            for next_page in response.css('a::attr("href")'):
                yield response.follow(next_page, self.parse)

    def closed(self, reason):
        '''
        Finalize crawl
        '''

The code for Spider2

# coding: utf8
import scrapy
from scrapy_splash import SplashRequest
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


class Spider2(scrapy.Spider):
    name = "spider2"
    urls = []
    page = 0

    def __init__(self, *args, **kwargs):
        super(Spider2, self).__init__(*args, **kwargs)
        self.query = kwargs.get('q')
        self.url = kwargs.get('url')
        self.start_urls = ['https://www.google.com/search?q=' + self.query]

    def start_requests(self):
        splash_args = {
            'wait:': 2,
        }
        for url in self.start_urls:
            splash_args = {
                'wait:': 1,
            }
            yield SplashRequest(url, self.parse, args=splash_args)

    def parse(self, response):
        '''
        Extract URLs to self.urls
        '''
        self.page += 1

    def closed(self, reason):
        process = CrawlerProcess(get_project_settings())
        for url in self.urls:
            print(url)
        if len(self.urls) > 0:
            process.crawl('lexi', start_urls=self.urls, q=self.query)
            process.start(False)

When running Spider2 I have this error : twisted.internet.error.ReactorAlreadyRunning and Spider1 is called without the list of URLs. I tried using CrawlRunner as advised by Scrapy documentation but it's the same problem. I tried using CrawlProcess inside parse method, it "works" but, I still have the error message. When using CrawlRunner inside parse method, it doesn't work.


Solution

  • Currently it is not possible to start a spider from another spider if you're using scrapy crawl command (see https://github.com/scrapy/scrapy/issues/1226). It is possible to start spider from a spider if you write a startup script yourselves - the trick is to use the same CrawlerProcess/CrawlerRunner instance.

    I'd not do that though, you're fighting agains the framework. It'd be nice to support this use case, but it is not really supported now.

    An easier way is to either rewrite your code to use a single Spider class, or to create a script (bash, Makefile, luigi/airflow if you want to be fancy) which runs scrapy crawl spider1 -o items.jl followed by scrapy crawl spider2; second spider can read items created by the first spider and generate start_requests accordingly.

    FTR: combining SplashRequests and regular scrapy.Requests in a single spider is fully supported (it should just work), you don't have to create separate spiders for them.