Search code examples
pythonscrapy

Scrapy: How to save a list from our spider class in a file when scraping


I am finding list of webpage links from a start url and then finding all links by following into those pages. I am currently saving these as a list in self.links, and I want to know how to save it as csv or json file after scraping is done. My goal is to call a new function to process data on each followed pages.

import scrapy
from scrapy.linkextractors import LinkExtractor

class MySpider(scrapy.Spider):
    name = "myspider"
    links = []
    start_urls = ["https://books.toscrape.com/"]
   
    # Define the `parse` method. This method will be called for each page that the spider crawls.

    def parse(self, response):
        to_avoid = ['tel','facebook','twitter','instagram','privacy','terms','contact','java','cookies','policies','google','mail']
        # to_allow = self.current
        le = LinkExtractor(deny=to_avoid)
        ex_links = le.extract_links(response)
        for href in ex_links:
            # print(href.url)
            url = response.urljoin(href.url)
            if self.current in url:
                self.links.append(url)
                # print(url)
                yield response.follow(url, callback = self.parse)

I tried using another parse_landing_page(self, response) function and yielded it in the parse() function. But didnt work


Solution

  • Scrapy has this functionality built in as Feed Exports. In order to use the feature all you have to do is yield a dictionary from your parse method and then specify where to save the contents on the command line or in the settings for your spider.

    For example:

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    
    class MySpider(scrapy.Spider):
        name = "myspider"
        links = []
        start_urls = ["https://books.toscrape.com/"]
        custom_settings = {
            "FEEDS": {
                "items.csv": {
                    "format": "csv",
                    "fields": ["link"],
                }
            }
        }
       
        # Define the `parse` method. This method will be called for each page that the spider crawls.
    
        def parse(self, response):
            to_avoid = ['tel','facebook','twitter','instagram','privacy','terms','contact','java','cookies','policies','google','mail']
            # to_allow = self.current
            le = LinkExtractor(deny=to_avoid)
            ex_links = le.extract_links(response)
            for href in ex_links:
                # print(href.url)
                url = response.urljoin(href.url)
                if self.current in url:
                    yield {'link': url}
                    yield response.follow(url, callback = self.parse)
    

    Or instead of using the custom settings you could just use the -o option on the command line:

    scrapy crawl myspider -o items.csv