Search code examples
pythondataframebeautifulsoup

How can I scrape these Wikipedia tables with BeautifulSoup?


I'm trying to scrape all the movies and release dates on this Wikipedia page across multiple tables.

This is my code:

url = "https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
res = requests.get(url).text
soup = BeautifulSoup(res, 'lxml')
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
    data = items.find_all(['th', 'td'])
    try:
        movie = data[0].i.a.text
    except IndexError:
        pass
        print("{}".format(movie))

However, I'm only getting the movie titles from the first 1930s–1940s table. What I'm hoping for is two columns like this:

"Snow White and the Seven Dwarfs"    "December 21, 1937"
"Pinocchio"                          "February 7, 1940"
"Fantasia"                           "November 13, 1940"

How would I get this?


Solution

  • Since you expect two columns (and potentially a dataframe), you can use read_html from :

    #pip install pandas
    import pandas as pd
    
    wiki_link = "https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
    ​
    df = (pd.concat(pd.read_html(wiki_link), ignore_index=True)
                [["Title", "Release date"]].dropna(subset=["Title"]))
    

    Output :

    print(df)
    
                                   Title       Release date
    0    Snow White and the Seven Dwarfs  December 21, 1937
    1                          Pinocchio   February 7, 1940
    2                           Fantasia  November 13, 1940
    ..                               ...                ...
    618         Untitled Zootopia sequel                TBA
    619                   World's Best ‡                TBA
    620            Wouldn't It Be Nice ‡                TBA
    
    [600 rows x 2 columns]