Search code examples
pythonweb-scrapingscrapy

get data from two different urls to the same ScrapyItem()


I'm new at scrapy and I've been trying to scrape this website: https://quotes.toscrape.com/

The datas I want is

  • quote;

  • author;

  • date of birth and

  • local of birth.

To get the first 2 datas (quote and author), I have to scrape from

https://quotes.toscrape.com/

But to get the other 2 (date of birth and local of birth) I have to go to "about author":

https://quotes.toscrape.com/author/[NAME OF THE AUTHOR\]

My items.py code is:


import scrapy

class QuotesItem(scrapy.Item):
quote = scrapy.Field()
author = scrapy.Field()
date_birth = scrapy.Field()
local_birth = scrapy.Field()

and quotesipder.py code is:


import scrapy
from ..items import QuotesItem  


class QuotespiderSpider(scrapy.Spider):
    name = "quotespider"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        all_items= QuotesItem()

        quotes = response.xpath("//div[@class='row']/div[@class='col-md-8']/div")

        for quote in quotes:
            all_items['quote'] = quote.xpath("./span[@class='text']/text()").get()
            all_items['author'] = quote.xpath("./span[2]/small/text()").get()
            # Here we get the first 2 datas.


            about = quote.xpath("./span[2]/small/following-sibling::a/@href").get()
            url_about = 'https://quotes.toscrape.com' + about  # URL to go to 'about author'.


            yield response.follow(url_about, callback=self.about_autor,
                                  cb_kwargs={'items': all_items})
            
            yield item


    def about_autor(self, response, items):  # Should get the other two datas (date_birth, local_bith)

        item['date_birth '] = response.xpath("/html/body/div/div[2]/p[1]/span[1]/text()").get()
        item['local_bith '] = response.xpath("/html/body/div/div[2]/p[1]/span[2]/text()").get()


        yield item

I have tried it using cb_kwargs parameter, like in the code quotespider.py, but it didn't work.

This is what I get:

[
{"quote": "quote1", 
"autor": "author1",  
"date_birth": "",
 "local_birth": ""}, # Empty for the first 10 items
...
{"quote":"quote10", 
"autor": "author10", 
"date_birth": "", 
"local_birth": ""}, # 10th element also empty

{"quote":"quote10", 
"autor": "author10", 
"date_birth": "December 16, 1775", 
"local_birth": "in Steventon"},  # 10th element *repeated* with wrong date_birth and local_birth 

{"quote": "quote10", 
"autor": "author10", 
"date_birth": "June 01, 1926", 
"local_birth": "United States"}, # 10th element *repeated* with wrong date_birth and local_birth

No local_birth or date_birth were added to the first 10 quotes (added in parse function), but the last quote is repeated with all 'local-birth' and 'date-birth'.

What i expected is to get is something like:


[{'quote': 'quote1',
'author': 'author1',
'date_birth': 'date_birth1',
'local_birth': 'local_birth1'},

{'quote': 'quote2',
'author': 'author2',
'date_birth': 'date_birth2',
'local_birth': 'local_birth2'},

{'quote': 'quote3',
'author': 'author3',
'date_birth': 'date_birth3',
'local_birth': 'local_birth'},
]

Solution

  • There are a number of typos in your code that need fixing such as in the about_autor method you pass in items and then the variable used in the method body is item. Also their is a yield item statement below your yield response.follow call in your parse method that would certainly throw an error.

    But beyond that there are few additional notes I will make.

    • When iterating a selector group you should move the item initialization to inside of the loop that way you are yielding a unique item on each yield and not overwriting the previous values of the same item.

    • cb_kwargs stands for callback keyword arguments, so the second parameter in about_autor should be a keyword argument.

    • since the quotes site features multiple quotes from the same author, you should add the dont_filter=True parameter to your call to response.follow so it doesn't filter duplicates when requesting an authors page more than once.

    This seemed to work fine for me.

    Example:

    import scrapy
    
    class QuotesItem(scrapy.Item):
        quote = scrapy.Field()
        author = scrapy.Field()
        date_birth = scrapy.Field()
        local_birth = scrapy.Field()
    
    class QuotespiderSpider(scrapy.Spider):
        name = "quotespider"
        allowed_domains = ["quotes.toscrape.com"]
        start_urls = ["https://quotes.toscrape.com/"]
    
        def parse(self, response):
            quotes = response.xpath("//div[@class='row']/div[@class='col-md-8']/div")
            for quote in quotes:
                item= QuotesItem()
                item['quote'] = quote.xpath("./span[@class='text']/text()").get()
                item['author'] = quote.xpath("./span[2]/small/text()").get()
                # Here we get the first 2 datas.
                about = quote.xpath("./span[2]/small/following-sibling::a/@href").get()
                url_about = 'https://quotes.toscrape.com' + about  # URL to go to 'about author'.
                yield response.follow(url_about, callback=self.about_autor,
                                      cb_kwargs={'item': item}, dont_filter=True)
    
        def about_autor(self, response, item={}):  # Should get the other two datas (date_birth, local_bith)
            item['date_birth'] = response.xpath("/html/body/div/div[2]/p[1]/span[1]/text()").get()
            item['local_birth'] = response.xpath("/html/body/div/div[2]/p[1]/span[2]/text()").get()
            yield item
    
    

    OUTPUT

    {'author': 'Steve Martin',
     'date_birth': 'August 14, 1945',
     'local_birth': 'in Waco, Texas, The United States',
     'quote': '“A day without sunshine is like, you know, night.”'}
    2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/J-K-Rowling/>
    {'author': 'J.K. Rowling',
     'date_birth': 'July 31, 1965',
     'local_birth': 'in Yate, South Gloucestershire, England, The United Kingdom',
     'quote': '“It is our choices, Harry, that show what we truly are, far more '
              'than our abilities.”'}
    2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Eleanor-Roosevelt/>
    {'author': 'Eleanor Roosevelt',
     'date_birth': 'October 11, 1884',
     'local_birth': 'in The United States',
     'quote': '“A woman is like a tea bag; you never know how strong it is until '
              "it's in hot water.”"}
    2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Albert-Einstein/>
    {'author': 'Albert Einstein',
     'date_birth': 'March 14, 1879',
     'local_birth': 'in Ulm, Germany',
     'quote': '“The world as we have created it is a process of our thinking. It '
              'cannot be changed without changing our thinking.”'}
    2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Marilyn-Monroe/>
    {'author': 'Marilyn Monroe',
     'date_birth': 'June 01, 1926',
     'local_birth': 'in The United States',
     'quote': "“Imperfection is beauty, madness is genius and it's better to be "
              'absolutely ridiculous than absolutely boring.”'}
    2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Thomas-A-Edison/>
    {'author': 'Thomas A. Edison',
     'date_birth': 'February 11, 1847',
     'local_birth': 'in Milan, Ohio, The United States',
     'quote': "“I have not failed. I've just found 10,000 ways that won't work.”"}
    2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Albert-Einstein/>
    {'author': 'Albert Einstein',
     'date_birth': 'March 14, 1879',
     'local_birth': 'in Ulm, Germany',
     'quote': '“Try not to become a man of success. Rather become a man of value.”'}
    2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Albert-Einstein/>
    {'author': 'Albert Einstein',
     'date_birth': 'March 14, 1879',
     'local_birth': 'in Ulm, Germany',
     'quote': '“There are only two ways to live your life. One is as though '
              'nothing is a miracle. The other is as though everything is a '
              'miracle.”'}
    2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Andre-Gide/>
    {'author': 'André Gide',
     'date_birth': 'November 22, 1869',
     'local_birth': 'in Paris, France',
     'quote': '“It is better to be hated for what you are than to be loved for '
              'what you are not.”'}
    2023-11-22 15:56:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Jane-Austen/>
    {'author': 'Jane Austen',
     'date_birth': 'December 16, 1775',
     'local_birth': 'in Steventon Rectory, Hampshire, The United Kingdom',
     'quote': '“The person, be it gentleman or lady, who has not pleasure in a '
              'good novel, must be intolerably stupid.”'}