I need to scrape every page under every category. Currently I'm able to go in to category of listings and scrape every page that follows with a next page. What I wanna do is that I wanna go in one category scrape every pages in that category and once this is done I wanna move on to the next category and do the same thing. And sometimes some categories has other categories in them nested.
For example; https://www.amazon.com/best-sellers-books-Amazon/zgbs/books/ref=zg_bs_unv_b_1_173508_2 (<- these are the books list) there are categories on the left for instance (Arts & Photography, Audible Audiobooks, ....) under each category for example under Arts & Photography category there are more categories (Architecture, Business of Art, ...) and then under Architecture there are more categories (Buildings, Criticism,...) under Buildings (Landmarks & Monuments, Religious Buildings,..) and once you get to Landmarks & Monuments thats the root node and it has 100 pages of listings. So what I wanna do i I wanna go in Arts & Photography and keep going under every subcategory till i hit a root node and scrape all the listings for every page and then go to the sibling nodes once I finish every sibling node I wanna rollback and go in Religious Buildings finish that rollback go to the next category under Buildings finish every category under Buildings rollback go in Criticism... and so on. So pretty much scrape every book under every subcategory listed in amazon.
right now I have this logic to do every pages in a category that is given in the start_urls.; (note: I can't really list every category in the start urls list since there are so many of them) Below code works and scrapes every page that is listed under one category that is given in the start url. What I need is the idea for how to make it so that it automatically jumps to next subcategory and do the same thing once it finish comeback and go to next subcategory....and so on
name = "my_crawler"
allowed_domains = ["somewebsite.com"]
start_urls = [
"someurl.....",
]
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="pagnNext"]',)), callback="parse_page", follow= True),)
def parse_page(self, response):
asds = Selector(response).xpath('//span[contains(@class,"price")]/text()').extract()
for asd in asds:
item['fsd'] = asd.xpath('@title').extract()[0]
yield item
Can anyone help?? Thanks
In the easy way, you could provide the urls of each category that you want to scrape, and enter it in the start_urls
start_url=['http://url_category1.html,http://url_category2.html,http://url_category3.html']
this is one way.
or you can make your own request using the href in each category,
Regards