Search code examples
pythonweb-scrapingscrapy

attribute error during recursive scraping with scrapy


I have a scrapy spider that works well as long as I give it a page that contains the links to the pages that it should scrape. Now I want to not give it all the categories but the page that contains links to all categories. I thought I could simply add another parse function in order to achieve this.

but the console output gives me an attribute error

"attributeError: 'zaubersonder' object has no attribute 'parsedetails'"

This tells me that some attribute reference is not working correctly. I am new to object orientation but I thought scrapy is calling parse which is calling prase_level2 which in turn calls parse_details and this should work fine.

below is my effort so far.

import scrapy


class zaubersonder(scrapy.Spider):
    name = 'zaubersonder'
    allowed_domains = ['abc.de']
    start_urls = ['http://www.abc.de/index.php/rgergegregre.html'
                 ]




    def parse(self, response):
        urls = response.css('a.ulSubMenu::attr(href)').extract() # links to categories
        for url in urls:
            url = response.urljoin(url)
            yield scrapy.Request(url=url,callback=self.parse_level2)

    def parse_level2(self, response):
        urls2 = response.css('a.ulSubMenu::attr(href)').extract() # links to entries
        for url2 in urls2:
            url2 = response.urljoin(url2)
            yield scrapy.Request(url=url2,callback=self.parse_details)

    def parse_details(self,response): #extract entries
        yield {
            "Titel": response.css("li.active.last::text").extract(),
            "Content": response.css('div.ce_text.first.block').extract() + response.css('div.ce_text.last.block').extract(),
        }

edit: fixed the code in case someone will search for it


Solution

  • There is a typo in the code. The callback in parse_level2 is self.parsedetails, but the function is named parse_details.

    Just change the yield in parse_level2 to:

    yield scrapy.Request(url=url2,callback=self.parse_details)
    

    ..and it should work better.