Search code examples

Beautiful Soup Headers with table data

I am scraping IMDB for the members of its cast (the IMDB API doesn't have comprehensive cast/credits data). The final product I want is a table with three columns that gets the data from all the tables in the webpage and sorts them like this:

Produced by | Gary Kurtz | producer 

Produced by | George Lucas | executive producer

Music by    | John Williams | 

(using star wars as an example,

The following code is almost there, but there is a ton of unnecessary whitespace, and the .parent function is surely being used wrong. What is the best way to find the value of the h4 above a table?

Here's the code.

 with open(fname, 'r') as f:
        soup = BeautifulSoup(,'html5lib')

        with open(fname, 'r') as f:
        soup = BeautifulSoup(,'html5lib')

        for child in soup.find_all('td',{'class':'name'}):
            print child.parent.text, child.parent.parent.parent.parent.parent.parent.text.encode('utf-8')

I'm trying to get the values such as "Directed by" from these h4 headers


  • Welcome to stackoverflow. It seems that you can find the h4 and table at same time, as they appear as a pair in the html, so you can zip them to for loop over them. After that you just get and format the text. Change your code to:

    soup = BeautifulSoup(, 'html5lib')
    for h4,table in zip(soup.find_all('h4'),soup.find_all('table')):
        header4 = " ".join(h4.text.strip().split())
        table_data = [" ".join(tr.text.strip().replace("\n", "").replace("...", "|").split())  for tr in table.find_all('tr')]
        print("%s | %s \n")%(header4,table_data)

    This will print:

    Directed by | [u'George Lucas'] 
    Writing Credits | [u'George Lucas | (written by)'] 
    Cast (in credits order) verified as complete | ['', u'Mark Hamill | Luke Skywalker', u'Harrison Ford | Han Solo', u'Carrie Fisher | Princess Leia Organa', u'Peter Cushing | Grand Moff Tarkin',...]
    Produced by | [u'Gary Kurtz | producer', u'George Lucas | executive producer', u'Rick McCallum | producer (1997 special version)'] 
    Music by | [u'John Williams'] 