php python selenium web-scraping zend-dom-query

Using python to take advantage of web page functions

I am trying to understand how this web site is working. There is an input form where you can provide a url. This form returns information retrieved from another site (Youtube). So:

My first and more interesting question is if anybody has any idea how this site retrieve the entire corpus of statements?

Alternatively, since now I am using the following code:

from BeautifulSoup import BeautifulSoup
import json

urlstr = 'http://www.sandracires.com/en/client/youtube/comments.php?v=' + videoId + '&page=' + str(npage)
url = urllib2.urlopen(urlstr)
content = url.read()
soup = BeautifulSoup(content)
#parse json
newDictionary=json.loads(str(soup)) 

#print example
print newDictionary['list'][1]['username']

However, I can not iterate in all pages (which is not happening when I to that manually). I have placed timer.sleep(30) below json but without success. Why is that happening?

Thanks!

^{Python 2.7.8}

Solution

Probably using the Google Youtube data API. Note that (presently) comments can only be retrieved using version 2 of the API - which has been deprecated. Apparently no support yet in V3. Python clients libraries are available, see https://developers.google.com/youtube/code#Python.

Response is already JSON, no need for BS. The web server seems to require cookies, so I recommend using requests module, in particular its session management:

import requests

videoId = 'ZSzeFFsKEt4'
results = []
npage = 1
session = requests.session()
while True:
    urlstr = 'http://www.sandracires.com/en/client/youtube/comments.php'
    print "Getting page ", npage
    response = session.get(urlstr, params={'v': videoId, 'page': npage})
    content = response.json()
    if len(content['list']) > 1:
        results.append(content)
    else:
        break
    npage += 1

print results