python beautifulsoup python-requests imdb

Problems retrieving information from imdb

I'm trying to get the movie titles from an imdb watchlist. This is my code:

import requests, bs4
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
print(soup.find_all('.lister-item-header'))

Even though '.lister-item-header' exists in the chrome developer console it doesn't exist in the html file that the requests module downloaded. I've also tried using regular expressions. What would be the best way of retrieving the titles?

Solution

You should select elements by their class in this way.

import requests
import bs4

url = 'http://www.imdb.com/chart/top'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
rows = soup.select('.titleColumn > a')

for row in rows:
    print(row.text)

Or you can do it in this way:

import requests
import bs4

url = 'http://www.imdb.com/chart/top'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
rows = soup.find_all('td', class_='titleColumn')

for row in rows:
    print(row.a.text)

The data is load from a json object which is embedded into the raw html file, so we can parse it and get the title.

import requests
import bs4 
import json

url = 'http://www.imdb.com/user/ur69187878/watchlist?ref_=wt_nv_wl‌_all_1' 
res = requests.get(url) 
res.raise_for_status() 
soup = bs4.BeautifulSoup(res.text, "html.parser") 
# rows = soup.find_all('h3', class_='list-item-header') 
js_elements = soup.find_all('script')
js_text = None
search_str = 'IMDbReactInitialState.push('

for element in js_elements:
    text = element.text
    if search_str in text:
        js_text = text.strip()
        break

json_start = js_text.index(search_str) + len(search_str)
json_text = js_text[json_start:-2]
json_obj = json.loads(js_text[json_start:-2])

for title in json_obj['titles']:
    json_title = json_obj['titles'][title]
    print(json_title['primary']['title'])

But I have to say that this is not a general method to attack this kind of problems, if you wanna have a general solution for all pages whose data is loaded from json or api, you can use some other ways such as Selenium.