Search code examples
pythoncsvfor-loopscrapyenumerate

Python - for loop which yields scraped data only looping once per page


Why does this scraping code only extract the very first title, author and quote from each page? (resulting in only a three-row csv file)?

I am using the web crawler scrapy and loading the data into a csv file. I'm using xpath, and have come across an issue properly loading my data. This is my first time using python, and i'm struggling to properly implement the enumerate/zip functions.

import scrapy
class MySpider(scrapy.Spider):
name = 'test'
custom_settings = {
    'FEED_FORMAT': 'csv',
    'FEED_URI': 'test.csv'
}
start_urls = [
    'http://quotes.toscrape.com/', 
    'http://quotes.toscrape.com/page/2/',
    'http://quotes.toscrape.com/page/3/'
]
def parse(self, response):
    titles =  response.xpath("//div[contains(@class, 'col-md-4')]/h2/text()").extract()
    authors = response.xpath("//small[contains(@class, 'author')]/text()").extract()
    quotes = response.xpath("//div[contains(@class, 'quote')]/span[contains(@class, 'text')]/text()").extract()
    for i, (title, author, quote) in enumerate(zip(titles, authors, quotes)):
        yield {'index': i, 'title': title, 'author': author, 'quote': quote}

Solution

  • The problem here is that zip only creates the same number of elements as the minimum list passed as an argument, in this case titles only contains 1 element, so it is correct that the for will only iterate once.

    If you want that same title for all of the elements, you should only iterate authors and quotes:

    title =  response.xpath("//div[contains(@class, 'col-md-4')]/h2/text()").extract_first()
    authors = response.xpath("//small[contains(@class, 'author')]/text()").extract()
    quotes = response.xpath("//div[contains(@class, 'quote')]/span[contains(@class, 'text')]/text()").extract()
    for i, (author, quote) in enumerate(zip(authors, quotes)):
        yield {'index': i, 'title': title, 'author': author, 'quote': quote}