I'm learning Scrapy. For example, there is a website http://quotes.toscrape.com . I'm creating a simple spider (scrapy genspider quotes). I want to parse quotes, as well as go to the author's page and parse his date of birth. I'm trying to do it this way, but nothing works.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com/"]
def parse(self, response):
quotes=response.xpath('//div[@class="quote"]')
item={}
for quote in quotes:
item['name']=quote.xpath('.//span[@class="text"]/text()').get()
item['author']=quote.xpath('.//small[@class="author"]/text()').get()
item['tags']=quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
url=quote.xpath('.//small[@class="author"]/../a/@href').get()
response.follow(url, self.parse_additional_page, item)
new_page=response.xpath('//li[@class="next"]/a/@href').get()
if new_page is not None:
yield response.follow(new_page,self.parse)
def parse_additional_page(self, response, item):
item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get()
yield item
Code without date of birth (is correct):
import scrapy
class QuotesSpiderSpider(scrapy.Spider):
name = "quotes_spider"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["https://quotes.toscrape.com/"]
def parse(self, response):
quotes=response.xpath('//div[@class="quote"]')
for quote in quotes:
yield {
'name':quote.xpath('.//span[@class="text"]/text()').get(),
'author':quote.xpath('.//small[@class="author"]/text()').get(),
'tags':quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
}
new_page=response.xpath('//li[@class="next"]/a/@href').get()
if new_page is not None:
yield response.follow(new_page,self.parse)
Question: how to go to the author's page for each quote and parse the date of birth?
How to go to the author's page for each quote and parse the date of birth?
You are actually really close to having it right. Just a couple of things you are missing and 1 thing that needs to be moved.
response.follow
returns a request object so unless you yield
that request object it will never be dispatched from the scrapy engine.
When passing objects from one callback function to another you should use the cb_kwargs
parameter. Using the meta
dictionary works too, but scrapy officially prefers using cb_kwargs
. however simply passing it as a positional argument will not work.
a dict
is mutable, this includes when they are used as scrapy items. So when you are creating scrapy items, each individual item should be unique. Otherwise when you go to update that item later you might end up mutating previously yielded items.
Here is an example that uses your code but implements the three points I made above.
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com/"]
def parse(self, response):
for quote in response.xpath('//div[@class="quote"]'):
# moving the item constructor inside the loop
# means it will be unique for each item
item={}
item['name']=quote.xpath('.//span[@class="text"]/text()').get()
item['author']=quote.xpath('.//small[@class="author"]/text()').get()
item['tags']=quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
url=quote.xpath('.//small[@class="author"]/../a/@href').get()
# you have to yield the request returned by response.follow
yield response.follow(url, self.parse_additional_page, cb_kwargs={"item": item})
new_page=response.xpath('//li[@class="next"]/a/@href').get()
if new_page is not None:
yield response.follow(new_page)
def parse_additional_page(self, response, item=None):
item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get()
yield item
Partial Output:
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Martin-Luther-King-Jr/>
{'name': '“Only in the darkness can you see the stars.”', 'author': 'Martin Luther King Jr.', 'tags': ['hope', 'inspirational'], 'additional_data': 'January 15, 1929'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/C-S-Lewis/>
{'name': '“You can never get a cup of tea large enough or a book long enough to suit me.”', 'author': 'C.S. Lewis', 'tags': ['books', 'inspirational', 'reading', 'tea'], 'additional_data': 'November 29, 1898'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/George-R-R-Martin/>
{'name': '“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”', 'author': 'George R.R. Martin', 'tags': ['read', 'readers', 'reading', 'reading-books'], 'additional_data': '
September 20, 1948'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/James-Baldwin/>
{'name': '“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”', 'author': 'James Baldwin', 'tags': ['love'], 'additional_data': 'August 02, 1924'}
Check out Passing additional data to callback functions and Response.follow
found in the scrapy docs for more information.