Search code examples
pythonhtmlweb-scrapingbeautifulsouphttplib2

Python urlopen and httplib both are unable to return the actual html of the page


I am trying to read information from this page: http://movie.douban.com/subject/20645098/comments

and use the following to find all the comment items.

comment_item = soup.find_all("div", {"id":"comment"})

However, I was unable to get anything returned and I realized the html that my script is reading is different than the html on the actual page. Below is what I have tried.

I first tried to use BeautifulSoup do the following:

comment_html = urlopen(section_url).read()
soup = BeautifulSoup(comment_html, "html.parser")

And the html that soup returns is not the same as the actual html. Then I tried httplib2 request as the following:

h = httplib2.Http()
resp, content = h.request(section_url, "GET")
soup = BeautifulSoup(content, "html.parser")

And it is still the same. :(


Solution

  • Here is a working example:

    import requests
    import BeautifulSoup as BeautifulSoup
    
    url = 'http://movie.douban.com/subject/20645098/comments'
    resp = requests.get(url)
    b = BeautifulSoup(resp.text)
    comments = b.findAll('div', {'class': 'comment'})
    
    print comments
    

    I used the requests library here, which I would highly recommend you use as well, but it has nothing to do with your problem. The problems with your code are the wrong method name (find_all) and that you want to look for a class and not for an id.