Search code examples
pythonscrapy

Scrapy and Python parsing


I'm learning Scrapy. For example, there is a website http://quotes.toscrape.com . I'm creating a simple spider (scrapy genspider quotes). I want to parse quotes, as well as go to the author's page and parse his date of birth. I'm trying to do it this way, but nothing works.

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        
        quotes=response.xpath('//div[@class="quote"]') 
        
        item={}

        for quote in quotes: 
            item['name']=quote.xpath('.//span[@class="text"]/text()').get()
            item['author']=quote.xpath('.//small[@class="author"]/text()').get()
            item['tags']=quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
            url=quote.xpath('.//small[@class="author"]/../a/@href').get()
            response.follow(url, self.parse_additional_page, item) 
            

        new_page=response.xpath('//li[@class="next"]/a/@href').get() 

        if new_page is not None: 

            yield response.follow(new_page,self.parse) 
            
    def parse_additional_page(self, response, item): 
        item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get() 
        yield item
            

Code without date of birth (is correct):

import scrapy 

  

  

class QuotesSpiderSpider(scrapy.Spider): 

    name = "quotes_spider" 

    allowed_domains = ["quotes.toscrape.com"] 

    start_urls = ["https://quotes.toscrape.com/"] 

     

    def parse(self, response): 

        quotes=response.xpath('//div[@class="quote"]') 

        for quote in quotes: 

            yield { 

                'name':quote.xpath('.//span[@class="text"]/text()').get(), 

                'author':quote.xpath('.//small[@class="author"]/text()').get(), 

                'tags':quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall() 

                } 

        new_page=response.xpath('//li[@class="next"]/a/@href').get() 

        if new_page is not None: 

            yield response.follow(new_page,self.parse) 

Question: how to go to the author's page for each quote and parse the date of birth?

How to go to the author's page for each quote and parse the date of birth?


Solution

  • You are actually really close to having it right. Just a couple of things you are missing and 1 thing that needs to be moved.

    1. response.follow returns a request object so unless you yield that request object it will never be dispatched from the scrapy engine.

    2. When passing objects from one callback function to another you should use the cb_kwargs parameter. Using the meta dictionary works too, but scrapy officially prefers using cb_kwargs. however simply passing it as a positional argument will not work.

    3. a dict is mutable, this includes when they are used as scrapy items. So when you are creating scrapy items, each individual item should be unique. Otherwise when you go to update that item later you might end up mutating previously yielded items.

    Here is an example that uses your code but implements the three points I made above.

    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        allowed_domains = ["quotes.toscrape.com"]
        start_urls = ["http://quotes.toscrape.com/"]
    
        def parse(self, response):
            for quote in response.xpath('//div[@class="quote"]'):
                # moving the item constructor inside the loop 
                # means it will be unique for each item
                item={}   
    
                item['name']=quote.xpath('.//span[@class="text"]/text()').get()
                item['author']=quote.xpath('.//small[@class="author"]/text()').get()
                item['tags']=quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
                url=quote.xpath('.//small[@class="author"]/../a/@href').get()
                # you have to yield the request returned by response.follow
                yield response.follow(url, self.parse_additional_page, cb_kwargs={"item": item})
            new_page=response.xpath('//li[@class="next"]/a/@href').get()
            if new_page is not None:
                yield response.follow(new_page)
    
        def parse_additional_page(self, response, item=None):
            item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get()
            yield item
    

    Partial Output:

    2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Martin-Luther-King-Jr/>
    {'name': '“Only in the darkness can you see the stars.”', 'author': 'Martin Luther King Jr.', 'tags': ['hope', 'inspirational'], 'additional_data': 'January 15, 1929'}
    2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/C-S-Lewis/>
    {'name': '“You can never get a cup of tea large enough or a book long enough to suit me.”', 'author': 'C.S. Lewis', 'tags': ['books', 'inspirational', 'reading', 'tea'], 'additional_data': 'November 29, 1898'}
    2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/George-R-R-Martin/>
    {'name': '“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”', 'author': 'George R.R. Martin', 'tags': ['read', 'readers', 'reading', 'reading-books'], 'additional_data': '
    September 20, 1948'}
    2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/James-Baldwin/>
    {'name': '“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”', 'author': 'James Baldwin', 'tags': ['love'], 'additional_data': 'August 02, 1924'}
    

    Check out Passing additional data to callback functions and Response.follow found in the scrapy docs for more information.