Search code examples
pythonweb-scrapingscrapy

Why is Scrapy CrawlSpider returning 'None' on this website: 'https://books.toscrape.com/'?


Below is the code with which I am trying to extact 3 values (UPC, Price & Availability) from this website: https://books.toscrape.com/. I am using the Scrapy CrawlSpider but it is returning 'None' for the extracted values. What I am trying to achieve with this code is this: Go inside every book on 1st page and extract the above mentioned 3 values. Code is below:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BooksSpider(CrawlSpider):
    name = "bookscraper"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    rules = (Rule(LinkExtractor(restrict_xpaths='//h3/a'), callback='parse_item', follow=True))

    def parse_item(self, response):

        product_info = response.xpath('//table[@class="table table-striped"]')

        upc = product_info.xpath('(./tbody/tr/td)[1]/text()').get()
        price = product_info.xpath('(./tbody/tr/td)[3]/text()').get()
        availability = product_info.xpath('(./tbody/tr/td)[6]/text()').get()

        yield {'UPC': upc, 'Price': price, 'Availability': availability}
        # print(response.url)

Solution

  • Your program reported an error TypeError: 'Rule' object is not iterable instead of returning None, see this answer: https://stackoverflow.com/a/53343029/18857676

    My modified and optimized code:

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    class BooksSpider(CrawlSpider):
        name = "bookscraper"
        allowed_domains = ["books.toscrape.com"]
        start_urls = ["https://books.toscrape.com/"]
    
        rules = (
            Rule(LinkExtractor(restrict_xpaths="//h3/a"), callback="parse_item", follow=True),
        )
    
        def parse_item(self, response):
            product_info = response.xpath('//table[@class="table table-striped"]//td/text()') # get each column td label directly
            temp = [i.extract().strip() for i in product_info]
            upc = temp[0]
            price = temp[2]
            availability = temp[5]
            return {'UPC': upc, 'Price': price, 'Availability': availability}