Search code examples
pythonscrapyscrapy-splash

Scrape monster.com using scrapy framework


How can I create a crawler for monster.com for crawling all the pages. For the "next page" link, monster.com calls a javascript function but scrapy does not recognize the javascriptAs you can see in the image

Here is my code, it not working for pagination:

import scrapy
class MonsterComSpider(scrapy.Spider):
    name = 'monster.com'
    allowed_domains = ['www.monsterindia.com']
    start_urls = ['http://www.monsterindia.com/data-analyst-jobs.html/']

    def parse(self, response):
        urls = response.css('h2.seotitle > a::attr(href)').extract()

        for url in urls:
            yield scrapy.Request(url =url, callback = self.parse_details)

    #crawling all the pages

        next_page_url = response.css('ul.pager > li > a::attr(althref)').extract()
        if next_page_url:
           next_page_url = response.urljoin(next_page_url) 
           yield scrapy.Request(url = next_page_url, callback = self.parse)            


    def parse_details(self,response):
        yield {         
        'name' : response.css('h3 > a > span::text').extract()
        }

Solution

  • Your code throws an exception, because next_page_url is a list and response.urljoin method needs a string. The next page link extraction should read like this:

    next_page_url = response.css('ul.pager > li > a::attr(althref)').extract_first()
    

    (i.e. replaced extract() with extract_first())

    EDIT:

    There is another problem with next_page_url extraction. All the logic is correct and pagination works, but next page link works only for the first page. It takes first a, but on the second page, there is also Previous page link. Modify the next page url extraction to this:

    next_page_url = response.css('ul.pager').xpath('//a[contains(text(), "Next")]/@althref').extract_first()
    

    Now it correctly paginates through all the pages.