Search code examples

Scrapy crawler not processing XHR Request

My spider is only crawling the first 10 pages, so I am assuming it is not entering the load more button though the Request.

I am scraping this website:

My spider code:

import scrapy
from scrapy.conf import settings
from scrapy.http import Request
from scrapy.selector import Selector
from reviews.items import ReviewItem

class T3Spider(scrapy.Spider):
    name = "t3" #spider name to call in terminal
    allowed_domains = [''] #the domain where the spider is allowed to crawl
    start_urls = [''] #url from which the spider will start crawling

    def parse(self, response):
        sel = Selector(response)
        review_links = sel.xpath('//div[@id="content"]//div/div/a/@href').extract()
        for link in review_links:
            yield Request(url=""+link, callback=self.parse_review)
#if there is a load-more button:
        if sel.xpath('//*[@class="load-more"]'):
            req = Request(url=r'http://www\.t3\.com/more/reviews/latest/\d+', headers = {"Referer": "", "X-Requested-With": "XMLHttpRequest"}, callback=self.parse)
            yield req

    def parse_review(self, response):
        pass #all my scraped item fields

What I am doing wrong? Sorry but I am quite new to scrapy. Thanks for your time, patience and help.


  • If you inspect the "Load More" button, you would not find any indication of how the link to load more reviews is constructed. The idea behind is rather easy - the numbers after suspiciously look like a timestamp of the last loaded article. Here is how you can get it:

    import calendar
    from dateutil.parser import parse
    import scrapy
    from scrapy.http import Request
    class T3Spider(scrapy.Spider):
        name = "t3"
        allowed_domains = ['']
        start_urls = ['']
        def parse(self, response):
            reviews = response.css('div.listingResult')
            for review in reviews:
                link = review.xpath("a/@href").extract()[0]
                yield Request(url="" + link, callback=self.parse_review)
            # TODO: handle exceptions here
            # extract the review date
            time = reviews[-1].xpath(".//time/@datetime").extract()[0]
            # convert a date into a timestamp
            timestamp = calendar.timegm(parse(time).timetuple())
            url = '' % timestamp
            req = Request(url=url,
                          headers={"Referer": "", "X-Requested-With": "XMLHttpRequest"},
            yield req
        def parse_review(self, response):
            print response.url


    • this requires dateutil module to be installed
    • you should recheck the code and make sure you are getting all of the reviews without skipping any of them
    • you should somehow end this "Load more" thing