I am trying to read information from this page: http://movie.douban.com/subject/20645098/comments
and use the following to find all the comment items.
comment_item = soup.find_all("div", {"id":"comment"})
However, I was unable to get anything returned and I realized the html that my script is reading is different than the html on the actual page. Below is what I have tried.
I first tried to use BeautifulSoup do the following:
comment_html = urlopen(section_url).read()
soup = BeautifulSoup(comment_html, "html.parser")
And the html that soup returns is not the same as the actual html. Then I tried httplib2 request as the following:
h = httplib2.Http()
resp, content = h.request(section_url, "GET")
soup = BeautifulSoup(content, "html.parser")
And it is still the same. :(
Here is a working example:
import requests
import BeautifulSoup as BeautifulSoup
url = 'http://movie.douban.com/subject/20645098/comments'
resp = requests.get(url)
b = BeautifulSoup(resp.text)
comments = b.findAll('div', {'class': 'comment'})
print comments
I used the requests library here, which I would highly recommend you use as well, but it has nothing to do with your problem. The problems with your code are the wrong method name (find_all
) and that you want to look for a class
and not for an id
.