I have a scrapy spider that works well as long as I give it a page that contains the links to the pages that it should scrape. Now I want to not give it all the categories but the page that contains links to all categories. I thought I could simply add another parse function in order to achieve this.
but the console output gives me an attribute error
"attributeError: 'zaubersonder' object has no attribute 'parsedetails'"
This tells me that some attribute reference is not working correctly. I am new to object orientation but I thought scrapy is calling parse which is calling prase_level2 which in turn calls parse_details and this should work fine.
below is my effort so far.
import scrapy
class zaubersonder(scrapy.Spider):
name = 'zaubersonder'
allowed_domains = ['abc.de']
start_urls = ['http://www.abc.de/index.php/rgergegregre.html'
]
def parse(self, response):
urls = response.css('a.ulSubMenu::attr(href)').extract() # links to categories
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url,callback=self.parse_level2)
def parse_level2(self, response):
urls2 = response.css('a.ulSubMenu::attr(href)').extract() # links to entries
for url2 in urls2:
url2 = response.urljoin(url2)
yield scrapy.Request(url=url2,callback=self.parse_details)
def parse_details(self,response): #extract entries
yield {
"Titel": response.css("li.active.last::text").extract(),
"Content": response.css('div.ce_text.first.block').extract() + response.css('div.ce_text.last.block').extract(),
}
edit: fixed the code in case someone will search for it
There is a typo in the code. The callback in parse_level2
is self.parsedetails
, but the function is named parse_details
.
Just change the yield
in parse_level2
to:
yield scrapy.Request(url=url2,callback=self.parse_details)
..and it should work better.