Search code examples
python-3.xpandasgroup-by

Pandas groupby produce unexpected result for dataframe with Multiindex


I have a pandas.DataFrame with a Multiindex having 3 levels. I would like to groupby the first 2 levels and keep the largest value in the 3rd level. pandas.groupby produce an unexpected result. I'm wondering what's the cause. And is there any parameters I should add here?

idxslc = pd.IndexSlice

def keep_latest_info(df):
    print('Before:')
    print(df)
    if len(df) == 1:
       return df

    max_dt = df.index.get_level_values(-1).max()
    df = df.loc[idxslc[:, :, max_dt]]
    print('After:' )
    print(df)
    return df 

df = pd.DataFrame(np.array(((1, 2, 2), (4, 9, 9), (5, 6, 7), (3, 9, 1)))).T.set_index([0, 1, 2])
df.index = df.index.rename('a-b-c'.split('-'))
df.groupby(level=[0, 1], group_keys=True, as_index=True).apply(keep_latest_info)

The results are as follows. As can be seen, the output of 2nd group (2nd & 3rd row) no longer has the level 3 index. Why is this?

Before:
       3
a b c   
1 4 5  3
Before:
       3
a b c   
2 9 6  9
    7  1
After:
     3
a b   
2 9  1


AssertionError: Cannot concat indices that do not have the same number of levels

Edit: The result is caused by IndexSlice. If using df = df.loc[idxslc[:, :, [max_dt]]] (with square bracket) in the function instead, everything works fine. Why does IndexSlice behave this way?


Solution

  • This is due to how the slicing is being performed within the keep_latest_info function. When you slice the DataFrame using df.loc[idxslc[:, :, max_dt]], it modifies the index levels, dropping the third level in this case.

    To resolve this issue, you need to maintain the MultiIndex structure throughout the function. You can achieve this by modifying the keep_latest_info function as follows:

    import pandas as pd
    import numpy as np
    
    idxslc = pd.IndexSlice
    
    def keep_latest_info(df):
        print('Before:')
        print(df)
        
        if len(df) == 1:
           return df
    
        max_dt = df.index.get_level_values(-1).max()
        df = df.loc[idxslc[:,:,max_dt], :]
        print('After:')
        print(df)
        return df 
    
    df = pd.DataFrame(np.array(((1, 2, 2), (4, 9, 9), (5, 6, 7), (3, 9, 1)))).T.set_index([0, 1, 2])
    df.index = df.index.rename('a-b-c'.split('-'))
    
    result = df.groupby(level=[0, 1], group_keys=True, as_index=False).apply(keep_latest_info)
    

    Changes made:

    1. Added [:, :] after idxslc[:,:,max_dt] to maintain all levels of the MultiIndex.

    2. Changed as_index=True to as_index=False in the groupby function call to keep the index as columns in the resulting DataFrame. With these changes, the function will retain the MultiIndex structure throughout the operation, and you should get the expected output without any errors.