Search code examples
pythonpython-3.xpandasdataframewikipedia

How to scrape data from wikipedia list into pandas dataframe


I'm trying to scrape a list, not a table, from a wikipedia page. It says "list index out of range": how can I solve this?

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://it.m.wikipedia.org/wiki/Premio_Bagutta'
data = requests.get(url)
soup= BeautifulSoup(data.content, "html.parser")
raw = soup.find_all("div", {"class": "div-col"})[0].find_all("li")

df = pd.DataFrame([[item.get_text().split(" ")[0],
                    item.find_next("a").get("title"),
                    item.find_next("i").get_text()[1:-1]]
                   for item in raw if item.find_next("i")],
                  columns=("Year"))
print(df.head())

Solution

  • You could try this:

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    data = requests.get("https://it.m.wikipedia.org/wiki/Premio_Bagutta")
    raw = BeautifulSoup(data.content, "html.parser").find_all(
        "section", class_="mf-section-2 collapsible-block"
    )[0]
    
    raw_years = [item.text.replace("\n", "") for item in raw.find_all("p")]
    raw_authors = [item for item in raw.find_all("ul")]
    
    # For some years, there are several authors, so you have to iterate in sync
    years = []
    authors = []
    for (year, author) in zip(raw_years, raw_authors):
        years.append(year)
        authors.append(author.text.split("\n"))
    
    df = pd.DataFrame({"year": years, "author": authors}).explode("author")
    
    print(df)
    # Output
        year                                                                author
    0   1927  Giovanni Battista Angioletti, Il giorno del giudizio[11][12] (Ribet)
    1   1928                      Giovanni Comisso, Gente di mare[13][14] (Treves)
    2   1929              Vincenzo Cardarelli, Il sole a picco[15][16] (Mondadori)
    3   1930                Gino Rocca, Gli ultimi furono i primi[17][18] (Treves)
    4   1931              Giovanni Titta Rosa, Il varco nel muro[19][20] (Carabba)
    ..   ...                                                                   ...
    82  2018                Helena Janeczek, La ragazza con la Leica[154] (Guanda)
    83  2019                            Marco Balzano, Resto qui[9][155] (Einaudi)
    84  2020                        Enrico Deaglio, La bomba[8][156] (Feltrinelli)
    85  2021                         Giorgio Fontana, Prima di noi[157] (Sellerio)
    86  2022                    Benedetta Craveri, La contessa[158][159] (Adelphi)