Search code examples
pythonbeautifulsoupimdb

Extracting Character Roles from Tom Holland's IMDB Page using BeautifulSoup


I extracted the following data from Tom Holland's IMDB page and defined it as "movie_contents":

[<div class="filmo-row odd" id="actor-tt10872600">
 <span class="year_column">
  2021
 </span>
 <b><a href="/title/tt10872600/">Untitled Spider-Man Sequel</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt10872600?rf=cons_nm_filmo">announced</a>)
 <br/>
 Peter Parker / Spider-Man
 </div>, <div class="filmo-row even" id="actor-tt1464335">
 <span class="year_column">
  2021
 </span>
 <b><a href="/title/tt1464335/">Uncharted</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt1464335?rf=cons_nm_filmo">filming</a>)
 <br/>
 Nathan Drake
 </div>, <div class="filmo-row odd" id="actor-tt2076822">
 <span class="year_column">
  2021
 </span>
 <b><a href="/title/tt2076822/">Chaos Walking</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt2076822?rf=cons_nm_filmo">post-production</a>)
 <br/>
 Todd Hewitt
 </div>, <div class="filmo-row even" id="actor-tt9130508">
 <span class="year_column">
  2020/I
 </span>
 <b><a href="/title/tt9130508/">Cherry</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt9130508?rf=cons_nm_filmo">post-production</a>)
 <br/>
 Nico Walker
 </div>, <div class="filmo-row odd" id="actor-tt7395114">
 <span class="year_column">
  2020
 </span>
 <b><a href="/title/tt7395114/">The Devil All the Time</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt7395114?rf=cons_nm_filmo">completed</a>)
 <br/>
 Arvin Russell
 </div>, <div class="filmo-row even" id="actor-tt7146812">
 <span class="year_column">
  2020/I
 </span>
 <b><a href="/title/tt7146812/">Onward</a></b>
 <br/>
 Ian Lightfoot (voice)
 </div>, <div class="filmo-row odd" id="actor-tt6673612">
 <span class="year_column">
  2020
 </span>
 <b><a href="/title/tt6673612/">Dolittle</a></b>
 <br/>
 Jip (voice)
 </div>

I'm having issuesHow can I extract all the character role names "Peter Parker / Spider-Man", "Nathan Drake", "Todd Hewitt", etc.?


Solution

  • This script will print all roles for the actor:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://www.imdb.com/name/nm4043618/'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    
    seen = set()
    for row in soup.select('#filmo-head-actor + div .filmo-row > br'):
        role = row.find_next(text=True).strip()
        if not role in seen:
            seen.add(role)
            print(role)
    

    Prints:

    Peter Parker / Spider-Man
    Nathan Drake
    Todd Hewitt
    Nico Walker
    Arvin Russell
    Ian Lightfoot (voice)
    Jip (voice)
    Walter (voice)
    Samuel Insull
    Brother Diarmuid - The Novice
    Jack Fawcett
    Bradley Baker
    Thomas Nickerson
    Tom
    Gregory Cromwell
    Former Billy (Encore) (uncredited)
    Isaac
    Eddie (voice)
    Boy
    Lucas
    Shô (UK version, voice)
    

    EDIT: To get the roles to DataFrame, you can do this:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    
    url = "https://www.imdb.com/name/nm4043618/"
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    seen = set()
    all_data = []
    for row in soup.select("#filmo-head-actor + div .filmo-row > br"):
        role = row.find_next(text=True).strip()
        if not role in seen:
            seen.add(role)
            all_data.append(role)
    
    df = pd.DataFrame(all_data, columns=["Role"])
    print(df)
    

    Prints:

                                      Role
    0            Peter Parker / Spider-Man
    1                         Nathan Drake
    2                          Todd Hewitt
    3                          Nico Walker
    4                        Arvin Russell
    5                Ian Lightfoot (voice)
    6                          Jip (voice)
    7                       Walter (voice)
    8                        Samuel Insull
    9        Brother Diarmuid - The Novice
    10                        Jack Fawcett
    11                       Bradley Baker
    12                    Thomas Nickerson
    13                                 Tom
    14                    Gregory Cromwell
    15  Former Billy (Encore) (uncredited)
    16                               Isaac
    17                       Eddie (voice)
    18                                 Boy
    19                               Lucas
    20             Shô (UK version, voice)