I have a pandas.DataFrame
with a Multiindex having 3 levels. I would like to groupby the first 2 levels and keep the largest value in the 3rd level. pandas.groupby
produce an unexpected result. I'm wondering what's the cause. And is there any parameters I should add here?
idxslc = pd.IndexSlice
def keep_latest_info(df):
print('Before:')
print(df)
if len(df) == 1:
return df
max_dt = df.index.get_level_values(-1).max()
df = df.loc[idxslc[:, :, max_dt]]
print('After:' )
print(df)
return df
df = pd.DataFrame(np.array(((1, 2, 2), (4, 9, 9), (5, 6, 7), (3, 9, 1)))).T.set_index([0, 1, 2])
df.index = df.index.rename('a-b-c'.split('-'))
df.groupby(level=[0, 1], group_keys=True, as_index=True).apply(keep_latest_info)
The results are as follows. As can be seen, the output of 2nd group (2nd & 3rd row) no longer has the level 3 index. Why is this?
Before:
3
a b c
1 4 5 3
Before:
3
a b c
2 9 6 9
7 1
After:
3
a b
2 9 1
AssertionError: Cannot concat indices that do not have the same number of levels
Edit:
The result is caused by IndexSlice
. If using df = df.loc[idxslc[:, :, [max_dt]]]
(with square bracket) in the function instead, everything works fine. Why does IndexSlice
behave this way?
This is due to how the slicing is being performed within the keep_latest_info function. When you slice the DataFrame using df.loc[idxslc[:, :, max_dt]]
, it modifies the index levels, dropping the third level in this case.
To resolve this issue, you need to maintain the MultiIndex structure throughout the function. You can achieve this by modifying the keep_latest_info function as follows:
import pandas as pd
import numpy as np
idxslc = pd.IndexSlice
def keep_latest_info(df):
print('Before:')
print(df)
if len(df) == 1:
return df
max_dt = df.index.get_level_values(-1).max()
df = df.loc[idxslc[:,:,max_dt], :]
print('After:')
print(df)
return df
df = pd.DataFrame(np.array(((1, 2, 2), (4, 9, 9), (5, 6, 7), (3, 9, 1)))).T.set_index([0, 1, 2])
df.index = df.index.rename('a-b-c'.split('-'))
result = df.groupby(level=[0, 1], group_keys=True, as_index=False).apply(keep_latest_info)
Changes made:
Added [:, :]
after idxslc[:,:,max_dt]
to maintain all levels of the MultiIndex.
Changed as_index=True
to as_index=False
in the groupby function call to keep the index as columns in the resulting DataFrame.
With these changes, the function will retain the MultiIndex structure throughout the operation, and you should get the expected output without any errors.