Search code examples
javascriptpythonhrefscrapy

Empty list for hrefs to achieve pagination through JavaScript onclick functions


My intension is to achieve the pagination from javascript functions, so for example I am taking the URL as http://events.justdial.com/events/index.php?city=Hyderabad, from this URL as you can see the pagination at the end of the page, so if you observe HTML of that they are written through JavaScript functions which has href tags as #, I am just trying to collect that href tags even though they are #. The following is my code

class justdialdotcomSpider(BaseSpider):
   name = "justdialdotcom"
   allowed_domains = ["www.justdial.com"]
   start_urls = ["http://events.justdial.com/events/index.php?city=Hyderabad"]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       pagination = hxs.select('//div[@id="main"]/div[@id="content"]/div[@id="pagination"]/a').extract()
       print pagination,">>>>>>>>>>>>>>>>>."

When I run the above code I am getting the result as [], I mean none,can anyone tell me how to achieve the pagination through that JavaScript onclick functions and why the result is empty.And I am observing some kind of wierd in HTML that for example one of the page in pagination has anchor tag as <a onclick="jdevents.setPageNo(2)" href="#">2</a> but when I tried to view this by clicking view page sourcethrough browser I can't see any function as jdevents.setPageNo(2), (I expect if we can see what he is doing in HTML we can post that through formdata as request) I am really confused and unable to go through this.


Solution

  • If you tracked the requests, you'll find post requests to the following URL : http://events.justdial.com/events/search.php

    Post Data :

    city:Hyderabad 
    cat:0 
    area:0 
    fromDate: 
    toDate: 
    subCat:0 
    pageNo:2
    fetch:events
    

    and the response is in JSON format.

    So, your code should be the following

    import re
    import json
    
    class justdialdotcomSpider(BaseSpider):
        name = "justdialdotcom"
        domain_name = "www.justdial.com"
        start_urls = ["http://events.justdial.com/events/search.php"]
    
    
        # Initial request
        def parse(self, response):
            return [FormRequest(url="http://events.justdial.com/events/search.php",
                                            formdata={'fetch': 'area',
                                                      'pageNo': '1',
                                                      'city' : 'Hyderabad',
                                                      'cat' : '0',
                                                      'area' : '0',
                                                      'fromDate': '',
                                                      'toDate' : '',
                                                      'subCat' : '0'
                                                      },
                                            callback=self.area_count
                                            )]
    
    
    # Get total count and paginate through events
        def area_count(self, response):
            total_count = 0
            for area in  json.loads(response.body):
                total_count += int(area["count"])
    
            pages_count = (total_count / 10) + 1
    
            page = 1
            while (page <= pages_count):
                yield FormRequest(url="http://events.justdial.com/events/search.php",
                                            formdata={'fetch': 'events',
                                                      'pageNo': str(page),
                                                      'city' : 'Hyderabad',
                                                      'cat' : '0',
                                                      'area' : '0',
                                                      'fromDate': '',
                                                      'toDate' : '',
                                                      'subCat' : '0'
                                                      },
                                            callback=self.parse_events
                                            )
                page += 1
    
    
    # parse events 
        def parse_events(self, response):
            events = json.loads(response.body)
            events.pop(0)
    
            for event_details in events:
                yield FormRequest(url="http://events.justdial.com/events/search.php",
                                            formdata={'fetch': 'event',
                                                      'eventId': str(event_details["id"]),
                                                      },
                                            callback=self.parse_event
                                            )
    
    
    
        def parse_event(self, response):
            event_details = json.loads(response.body)
            items = []
            #item = Product()
    
            items.append(item)
            return items