Search code examples
pythonpython-2.7urllib2

Contents are missing in urllib2.urlopen()


Am parsing a web page by sending a request as,

request = urllib2.Request(urllink, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
print request
urlfile = urllib2.urlopen(request)
page = urlfile.read()
soup = BeautifulSoup(page)

Here the problem is some of the contents in the web page are missing in response coming from urllib2.urlopen(). If i saved the page, am getting all contents. I have noticed that one more request is going inside the web page through ajax call. Is there any method in python to get whole page by sending request


Solution

  • AJAX is asynchronous JS and XML - it means that you GET document, and after loading in browser some content is dynamically donloaded and injected in DOM.

    What does it mean for you? You have all informations needed to get full document, but... well you probably have no way to execute JS which will download and inject dynamic data.

    How to bypas this? I haven't found any JS engine for python yet, but I'm still searching. Instead, you can use some browser engine using Selenium (it is library that comunicates with browser installed on your computer and allows you to simulate user actions, like clicks, inputs, etc). Then, you can inspect DOM after those actions and perform another actions.

    Other way is using Jython (as you're using p2.7, it should be compatible) and taking advantage of Rhino, or any other JS engine for Java to execute this code.