I am scraping IMDB for the members of its cast (the IMDB API doesn't have comprehensive cast/credits data). The final product I want is a table with three columns that gets the data from all the tables in the webpage and sorts them like this:
Produced by | Gary Kurtz | producer
Produced by | George Lucas | executive producer
Music by | John Williams |
(using star wars as an example, http://www.imdb.com/title/tt0076759/fullcredits?ref_=tt_cl_sm#cast)
The following code is almost there, but there is a ton of unnecessary whitespace, and the .parent function is surely being used wrong. What is the best way to find the value of the h4 above a table?
Here's the code.
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html5lib')
soup.prettify()
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html5lib')
soup.prettify()
for child in soup.find_all('td',{'class':'name'}):
print child.parent.text, child.parent.parent.parent.parent.parent.parent.text.encode('utf-8')
I'm trying to get the values such as "Directed by" from these h4 headers
Welcome to stackoverflow. It seems that you can find the h4
and table
at same time, as they appear as a pair in the html, so you can zip them to for loop over them. After that you just get and format the text. Change your code to:
soup = BeautifulSoup(f.read(), 'html5lib')
for h4,table in zip(soup.find_all('h4'),soup.find_all('table')):
header4 = " ".join(h4.text.strip().split())
table_data = [" ".join(tr.text.strip().replace("\n", "").replace("...", "|").split()) for tr in table.find_all('tr')]
print("%s | %s \n")%(header4,table_data)
This will print:
Directed by | [u'George Lucas']
Writing Credits | [u'George Lucas | (written by)']
Cast (in credits order) verified as complete | ['', u'Mark Hamill | Luke Skywalker', u'Harrison Ford | Han Solo', u'Carrie Fisher | Princess Leia Organa', u'Peter Cushing | Grand Moff Tarkin',...]
Produced by | [u'Gary Kurtz | producer', u'George Lucas | executive producer', u'Rick McCallum | producer (1997 special version)']
Music by | [u'John Williams']
...