Search code examples
pythonpandasmulti-index

Custom Aggregation Across Parallel Hierarchy Levels in a Multi-Index


I have a dataframe that is organised hierarchically. Consider this:

                                         baseval
indexlevel0 indexlevel1 indexlevel2         
L0-0        L1-0        L2-0               1
                        L2-1               1
                        L2-2              20
            L1-1        L2-0               2
                        L2-1               2
                        L2-2              10

What I need to do is create a new dataframe, that collapses the intermediate level (indexlevel1) by replacing the corresponding (indexlevel2) with a single value that is the minimum of the two levels that were once contained in indexlevel1.

Probably easier to just show what I mean - the solution to the above example would be (that is, indexlevel0, and 2 are preserved, along with the minimum basevals per-indexlevel2):

                         minbylevel
indexlevel0 indexlevel2            
L0-0        L2-0                  1
            L2-1                  1
            L2-2                 10

I have not the slightest idea of where to even start with this. All the examples of aggregation etc work from the bottom up.

Here's some test code to create the starting point dataframe.

import pandas as pd
from io import StringIO

testdata = """
indexlevel0,indexlevel1,indexlevel2,baseval
L0-0,L1-0,L2-0,1
L0-0,L1-0,L2-1,1
L0-0,L1-0,L2-2,20
L0-0,L1-1,L2-0,2
L0-0,L1-1,L2-1,2
L0-0,L1-1,L2-2,10
"""
testinput = StringIO(testdata)

data_df = pd.read_csv(testinput, index_col=[0,1,2], header=[0]).sort_index()
print(data_df)

Solution

  • If only one column DataFrame is possible aggregate min per first and third level of MultiIndex:

    df = data_df.groupby(level=[0,2]).min()
    print (df)
                             baseval
    indexlevel0 indexlevel2         
    L0-0        L2-0               1
                L2-1               1
                L2-2              10
    

    Or if there is multiple columns use DataFrameGroupBy.idxmin with DataFrame.loc and MultiIndex.droplevel:

    df = data_df.loc[data_df.groupby(level=[0,2])['baseval'].idxmin()].droplevel(1)
    print (df)
                             baseval
    indexlevel0 indexlevel2         
    L0-0        L2-0               1
                L2-1               1
                L2-2              10