How can I create a crawler for monster.com for crawling all the pages. For the "next page" link, monster.com calls a javascript function but scrapy does not recognize the javascript
Here is my code, it not working for pagination:
import scrapy
class MonsterComSpider(scrapy.Spider):
name = 'monster.com'
allowed_domains = ['www.monsterindia.com']
start_urls = ['http://www.monsterindia.com/data-analyst-jobs.html/']
def parse(self, response):
urls = response.css('h2.seotitle > a::attr(href)').extract()
for url in urls:
yield scrapy.Request(url =url, callback = self.parse_details)
#crawling all the pages
next_page_url = response.css('ul.pager > li > a::attr(althref)').extract()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url = next_page_url, callback = self.parse)
def parse_details(self,response):
yield {
'name' : response.css('h3 > a > span::text').extract()
}
Your code throws an exception, because next_page_url
is a list
and response.urljoin
method needs a string. The next page link extraction should read like this:
next_page_url = response.css('ul.pager > li > a::attr(althref)').extract_first()
(i.e. replaced extract()
with extract_first()
)
EDIT:
There is another problem with next_page_url
extraction. All the logic is correct and pagination works, but next page link works only for the first page. It takes first a
, but on the second page, there is also Previous page link. Modify the next page url extraction to this:
next_page_url = response.css('ul.pager').xpath('//a[contains(text(), "Next")]/@althref').extract_first()
Now it correctly paginates through all the pages.