Search code examples
pythonhtmlpandashtml-parsingwikipedia

Pandas read_html returned column with NaN values in Python


I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..

wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)

Solution

  • Try something like this (include flavor as bs4):

    df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
    
    df = df[0]
    print(df.head())
    
       Image                                 Stadium         City State  \
    0    NaN                  Aggie Memorial Stadium   Las Cruces    NM   
    1    NaN                               Alamodome  San Antonio    TX   
    2    NaN  Alaska Airlines Field at Husky Stadium      Seattle    WA   
    3    NaN                      Albertsons Stadium        Boise    ID   
    4    NaN                Allen E. Paulson Stadium   Statesboro    GA   
    
                   Team     Conference   Capacity  \
    0  New Mexico State    Independent  30,343[1]   
    1              UTSA          C-USA      65000   
    2        Washington         Pac-12  70,500[2]   
    3       Boise State  Mountain West  36,387[3]   
    4  Georgia Southern       Sun Belt      25000   
    .............................
    .............................
    

    To replace anything under square brackets use:

    df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
    print(df.Capacity.head())
    
    0    30,343
    1     65000
    2    70,500
    3    36,387
    4     25000
    

    Hope this helps.