Search code examples
pythonscrapyscrapy-pipeline

How to run multiple spiders through individual pipelines?


Total noob just getting started with scrapy.

In my directory structure I have like this...

#FYI: running on Scrapy 2.4.1
WebScraper/
  Webscraper/
     spiders/
        spider.py    # (NOTE: contains spider1 and spider2 classes.)
     items.py
     middlewares.py
     pipelines.py    # (NOTE: contains spider1Pipeline and spider2Pipeline)
     settings.py     # (NOTE: I wrote here:
                     #ITEM_PIPELINES = {
                     #  'WebScraper.pipelines.spider1_pipelines': 300,
                     #  'WebScraper.pipelines.spider2_pipelines': 300,
                     #} 
  scrapy.cfg

And spider2.py resembles...

class OneSpider(scrapy.Spider):
    name = "spider1"

    def start_requests(self):
        urls = ["url1.com",]
        yield scrapy.Request(
            url="http://url1.com",
            callback=self.parse
        )

    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff

class TwoSpider(scrapy.Spider):
    name = "spider2"

    def start_requests(self):
        urls = ["url2.com",]
        yield scrapy.Request(
            url="http://url2.com",
            callback=self.parse
        )

    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff

With pipelines.py looking like...

class spider1_pipelines(object): 
    def __init__(self): 
        self.csvwriter = csv.writer(open('spider1.csv', 'w', newline=''))
        self.csvwriter.writerow(['header1', 'header2'])
    def process_item(self, item, spider):
        row = []
        row.append(item['header1'])
        row.append(item['header2'])
        self.csvwrite.writerow(row)
        
class spider2_pipelines(object):
    def __init__(self): 
        self.csvwriter = csv.writer(open('spider2.csv', 'w', newline=''))
        self.csvwriter.writerow(['header_a', 'header_b'])
    def process_item(self, item, spider):
        row = []
        row.append(item['header_a']) #NOTE: this is not the same as header1
        row.append(item['header_b']) #NOTE: this is not the same as header2
        self.csvwrite.writerow(row)

I have a question about running spider1 and spider2 on different urls with one terminal command:

nohup scrapy crawl spider1 -o spider1_output.csv --logfile spider1.log & scrapy crawl spider2 -o spider2_output.csv --logfile spider2.log

Note: this is an extension of a previous question specific to this stack overflow post (2018).

Desired result: spider1.csv with data from spider1, spider2.csv with data from spider2.

Current result: spider1.csv with data from spider1, spider2.csv BREAKS but error log contains spider2 data, and that there was a keyerror ['header1'], even though the item for spider2 does not include header1, it only includes header_a.

Does anyone know how to run one spider after the other on different urls, and plug data fetched by spider1, spider2, etc. into pipelines specific to that spider, as in spider1 -> spider1Pipeline -> spider1.csv, spider2 -> spider2Pipelines -> spider2.csv.

Or perhaps this is a matter of specifying the spider1_item and spider2_item from items.py? I wonder if I can specify where to insert spider2's data that way.

Thank you!


Solution

  • You can implement this using custom_settings spider attribute to set settings individually per spider

    #spider2.py
    class OneSpider(scrapy.Spider):
        name = "spider1"
        custom_settings = {
            'ITEM_PIPELINES': {'WebScraper.pipelines.spider1_pipelines': 300}
    ...
    class TwoSpider(scrapy.Spider):
        name = "spider2"
        custom_settings = {
            'ITEM_PIPELINES': {'WebScraper.pipelines.spider2_pipelines': 300}
    ...