Search code examples
pythonbeautifulsouppython-requestsimdb

Problems retrieving information from imdb


I'm trying to get the movie titles from an imdb watchlist. This is my code:

import requests, bs4
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
print(soup.find_all('.lister-item-header'))

Even though '.lister-item-header' exists in the chrome developer console it doesn't exist in the html file that the requests module downloaded. I've also tried using regular expressions. What would be the best way of retrieving the titles?


Solution

  • You should select elements by their class in this way.

    import requests
    import bs4
    
    url = 'http://www.imdb.com/chart/top'
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, "html.parser")
    rows = soup.select('.titleColumn > a')
    
    for row in rows:
        print(row.text)
    

    Or you can do it in this way:

    import requests
    import bs4
    
    url = 'http://www.imdb.com/chart/top'
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, "html.parser")
    rows = soup.find_all('td', class_='titleColumn')
    
    for row in rows:
        print(row.a.text)
    

    The data is load from a json object which is embedded into the raw html file, so we can parse it and get the title.

    JSON

    import requests
    import bs4 
    import json
    
    url = 'http://www.imdb.com/user/ur69187878/watchlist?ref_=wt_nv_wl‌​_all_1' 
    res = requests.get(url) 
    res.raise_for_status() 
    soup = bs4.BeautifulSoup(res.text, "html.parser") 
    # rows = soup.find_all('h3', class_='list-item-header') 
    js_elements = soup.find_all('script')
    js_text = None
    search_str = 'IMDbReactInitialState.push('
    
    for element in js_elements:
        text = element.text
        if search_str in text:
            js_text = text.strip()
            break
    
    json_start = js_text.index(search_str) + len(search_str)
    json_text = js_text[json_start:-2]
    json_obj = json.loads(js_text[json_start:-2])
    
    for title in json_obj['titles']:
        json_title = json_obj['titles'][title]
        print(json_title['primary']['title'])
    

    But I have to say that this is not a general method to attack this kind of problems, if you wanna have a general solution for all pages whose data is loaded from json or api, you can use some other ways such as Selenium.