Why does this scraping code only extract the very first title, author and quote from each page? (resulting in only a three-row csv file)?
I am using the web crawler scrapy
and loading the data into a csv file. I'm using xpath, and have come across an issue properly loading my data. This is my first time using python, and i'm struggling to properly implement the enumerate/zip functions.
import scrapy
class MySpider(scrapy.Spider):
name = 'test'
custom_settings = {
'FEED_FORMAT': 'csv',
'FEED_URI': 'test.csv'
}
start_urls = [
'http://quotes.toscrape.com/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/'
]
def parse(self, response):
titles = response.xpath("//div[contains(@class, 'col-md-4')]/h2/text()").extract()
authors = response.xpath("//small[contains(@class, 'author')]/text()").extract()
quotes = response.xpath("//div[contains(@class, 'quote')]/span[contains(@class, 'text')]/text()").extract()
for i, (title, author, quote) in enumerate(zip(titles, authors, quotes)):
yield {'index': i, 'title': title, 'author': author, 'quote': quote}
The problem here is that zip
only creates the same number of elements as the minimum list passed as an argument, in this case titles
only contains 1
element, so it is correct that the for will only iterate once.
If you want that same title for all of the elements, you should only iterate authors
and quotes
:
title = response.xpath("//div[contains(@class, 'col-md-4')]/h2/text()").extract_first()
authors = response.xpath("//small[contains(@class, 'author')]/text()").extract()
quotes = response.xpath("//div[contains(@class, 'quote')]/span[contains(@class, 'text')]/text()").extract()
for i, (author, quote) in enumerate(zip(authors, quotes)):
yield {'index': i, 'title': title, 'author': author, 'quote': quote}