python-3.x pandas multi-index percentile

Pandas create percentile field based on groupby with level 1

Given the following data frame:

import pandas as pd    
df = pd.DataFrame({
    ('Group', 'group'): ['a','a','a','b','b','b'],
    ('sum', 'sum'): [234, 234,544,7,332,766]
    })

I'd like to create a new field which calculates the percentile of each value of "sum" per group in "group". The trouble is, I have 2 header columns and cannot figure out how to avoid getting the error:

ValueError: level > 0 only valid with MultiIndex

when I run this:

df=df.groupby('Group',level=1).sum.rank(pct=True, ascending=False)

I need to keep the headers in the same structure.

Thanks in advance!

Solution

To group by the first column, ('Group', 'group'), and compute the rank for the ('sum', 'sum') column use:

In [106]: df['rank'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')]).rank(pct=True, ascending=False))

In [107]: df
Out[107]: 
  Group  sum      rank
  group  sum          
0     a  234  0.833333
1     a  234  0.833333
2     a  544  0.333333
3     b    7  1.000000
4     b  332  0.666667
5     b  766  0.333333

Note that .rank(pct=True) computes a percentage rank, not a percentile. To compute a percentile you could use scipy.stats.percentileofscore.

import scipy.stats as stats
df['percentile'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')])
    .apply(lambda ser: 100-pd.Series([stats.percentileofscore(ser, x, kind='rank') 
           for x in ser], index=ser.index)))

yields

  Group  sum      rank percentile
  group  sum                     
0     a  234  0.833333  50.000000
1     a  234  0.833333  50.000000
2     a  544  0.333333   0.000000
3     b    7  1.000000  66.666667
4     b  332  0.666667  33.333333
5     b  766  0.333333   0.000000