Search code examples
pythonpython-3.xmulti-index

Excel to Pandas with a multi-level index producing NaN


I'm using this dataset:

https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/datasets/commutingtoworkbygenderukcountryandregion

Loaded thus:

commuting_data_xls = pd.ExcelFile(commuting_data_filename)
commuting_data_sheets = commuting_data_front['Table description '].dropna()
commuting_data_1 = pd.read_excel(commuting_data_xls, '1', header=4, usecols=range(1,13))
commuting_data_1.dropna().dropna(axis=1)

The resulting hierarchical index only gets the rows right where all index columns are specified.

enter image description here

How can I correct this and name the index columns?


Solution

  • Try the following steps:

    1. Open using pd.read_excel(), just the sheet and range you want.

      commuting_data_xls = pd.read_excel("commutingdata.xlsx",'1', header=4, usecols=range(1,13))

    2. Reset the multi index names.

      commuting_data_xls.index.names = ['Gender', 'Work_Region', 'Region']

    Reset the index and then restrict the rows to elimiate the totals, I assume you want them gone? If not just remove the iloc step.

    commuting_data_xls = commuting_data_xls.reset_index().iloc[0:28]
    

    Remove the 'Work_Region' column as this seems superfluous.

    commuting_data_xls = commuting_data_xls.loc[:,commuting_data_xls.columns != 'Work_Region']
    

    Fill down the Gender column to replace NaN

    commuting_data_xls['Gender'].fillna(method='ffill', inpldace=True)
    

    Reset the index if it suits your purposes.

    commuting_data_xls.set_index('Gender', 'Region')