Search code examples
pythonpandasdata-analysis

Pandas, Excel-Import and MultiIndex


I am new to pandas and currently trying to make some analysis of Excel-data in the following Schema:

enter image description here

My goal is a visualisation with the index-labels XYZ, CDE, EFG, HU on the x-axis and the corresponing Perc-values of Yes, ProbYes, X, ProbNo, No stacked on the y-axis.

Currently I'm parsing the Excel-data into a panda DataFrame via the code:

import pandas as pd
path = 'x1.xlsx'
x = pd.ExcelFile(path)
sheets = x.sheet_names
table = x.parse(sheets[0], header=2) # take line 2 as column-names

The generated MultiIndex of table seems to be fine:

>>> table.index 
MultiIndex(levels=[[u'Individual', u'Summary'], [u'ABC', u'CDE', u'EFG', u'HIJ'], [u'Abs', u'Perc']], labels=[[0, -1, -1, -1, -1, -1, -1, -1, 1, -1], [0, -1, 1, -1, 2, -1, 3, -1, -1, -1], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]])

However, it's neither possible to access a specific row:

>>> table.ix[('Individual', 'CDE')]
KeyError: 'Key length (2) was greater than MultiIndex lexsort depth (0)'

... nor is it possible to access or filter / remove a column containing the row indices

>>> table.index.names
FrozenList([None, None, None])

i.e. the indices have no names and so I guess I cannot access them?

I tried to simplify the data structure via the alternative parse:

>>> table2 = x.parse(sheets[0], header=2, skiprows=2, parse_cols='B,:I')

that, however, didnt really help.

Edit:

It doenst help to sort:

>>> table.sort(inplace=True)
>>> table[:4]

yields:

enter image description here

and with that table I cannot do my analysis...


Solution

  • I think after doing

    table.sort(inplace=True)
    

    you may be able to access

    table.ix[('Individual', 'CDE')]
    

    EDIT:

    I know why -- Your excel file has merged cells such A4:A11. When you load it into pandas DataFrame, the index Individual is only in A4 while indices in A5:A11 are all nan

    One work-around I can think of is:

    table =table.reset_index().fillna(method='ffill').set_index(['level_0','level_1','level_2'])
    #reset_index() automatically gives column names level_?
    

    Then you are good to go:

    table.ix[('Individual','CDE')]