Search code examples
pandassortingdataframemulti-index

Pandas Return top N from each group in MultiIndex


I have a large multi-index dataframe with about 3 million data points. The first index is the date, the second is the name of the object. Then, each object has a value like so:

df = pd.DataFrame({'year': [2012, 2012, 2012, 2013, 2013, 2013],
'item':['apples', 'metals', 'water', 'apples', 'metals', 'water'],
'value': [10, 14, 16, 17, 13, 14]})

df.set_index(['year', 'item'], inplace = True)

This is a smaller example of the dataframe I'm working with. I want to return a dataframe that keeps the exact same format, but only has the top two values from each year.

So the final dataframe should output metals and water from 2012 with their value, and apples and water from 2013 with their respective value, and in the same format as the original dataframe.


Solution

  • IIUC sort_values before groupby tail

    df.sort_values('value').groupby(level=0).tail(2).sort_index()
    Out[258]: 
                 value
    year item         
    2012 metals     14
         water      16
    2013 apples     17
         water      14