Search code examples
python-2.7scrapyasciinon-ascii-characters

How can I paginate the web pages of the following kind?


I'm trying to paginate the pages of this site(http://www.geny-interim.com/offres/). The problem is I've used css selector to go through each page by using this code

next_page_url=response.css('a.page:nth-child(4)::attr(href)').extract_first()
        if next_page_url:
            yield scrapy.Request(next_page_url)

But doing this will only paginate to two pages and then css selector not working as expected. I tried to use this also:

response.xpath('//*[contains(text(), "›")]/@href/text()').extract_first()

but this is also producing value error. Any help would be upvoted.


Solution

  • There's a problem with this XPath expression

    //*[contains(text(), "›")]/@href/text()
    

    because href attribute doesn't have a text() property.

    Here's a working spider you can adapt to your needs:

    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class GenyInterimSpider(scrapy.Spider):
        name = 'geny-interim'
        start_urls = ['http://www.geny-interim.com/offres/']
    
        def parse(self, response):
            for offer in response.xpath('//div[contains(@class,"featured-box")]'):
                yield {
                    'title': offer.xpath('.//h3/a/text()').extract_first()
                }
            next_page_url = response.xpath('//a[@class="page" and contains(.,"›")]/@href').extract_first()
            if next_page_url:
                yield scrapy.Request(response.urljoin(next_page_url), callback=self.parse)