Search code examples
phppythonseleniumweb-scrapingzend-dom-query

Using python to take advantage of web page functions


I am trying to understand how this web site is working. There is an input form where you can provide a url. This form returns information retrieved from another site (Youtube). So:

  1. My first and more interesting question is if anybody has any idea how this site retrieve the entire corpus of statements?

  2. Alternatively, since now I am using the following code:

    from BeautifulSoup import BeautifulSoup
    import json
    
    urlstr = 'http://www.sandracires.com/en/client/youtube/comments.php?v=' + videoId + '&page=' + str(npage)
    url = urllib2.urlopen(urlstr)
    content = url.read()
    soup = BeautifulSoup(content)
    #parse json
    newDictionary=json.loads(str(soup)) 
    
    #print example
    print newDictionary['list'][1]['username']
    

    However, I can not iterate in all pages (which is not happening when I to that manually). I have placed timer.sleep(30) below json but without success. Why is that happening?

Thanks!

Python 2.7.8


Solution

    1. Probably using the Google Youtube data API. Note that (presently) comments can only be retrieved using version 2 of the API - which has been deprecated. Apparently no support yet in V3. Python clients libraries are available, see https://developers.google.com/youtube/code#Python.

    2. Response is already JSON, no need for BS. The web server seems to require cookies, so I recommend using requests module, in particular its session management:

      import requests
      
      videoId = 'ZSzeFFsKEt4'
      results = []
      npage = 1
      session = requests.session()
      while True:
          urlstr = 'http://www.sandracires.com/en/client/youtube/comments.php'
          print "Getting page ", npage
          response = session.get(urlstr, params={'v': videoId, 'page': npage})
          content = response.json()
          if len(content['list']) > 1:
              results.append(content)
          else:
              break
          npage += 1
      
      print results