Search code examples
python-3.xpandasmulti-indexpercentile

Pandas create percentile field based on groupby with level 1


Given the following data frame:

import pandas as pd    
df = pd.DataFrame({
    ('Group', 'group'): ['a','a','a','b','b','b'],
    ('sum', 'sum'): [234, 234,544,7,332,766]
    })

I'd like to create a new field which calculates the percentile of each value of "sum" per group in "group". The trouble is, I have 2 header columns and cannot figure out how to avoid getting the error:

ValueError: level > 0 only valid with MultiIndex

when I run this:

df=df.groupby('Group',level=1).sum.rank(pct=True, ascending=False)

I need to keep the headers in the same structure.

Thanks in advance!


Solution

  • To group by the first column, ('Group', 'group'), and compute the rank for the ('sum', 'sum') column use:

    In [106]: df['rank'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')]).rank(pct=True, ascending=False))
    
    In [107]: df
    Out[107]: 
      Group  sum      rank
      group  sum          
    0     a  234  0.833333
    1     a  234  0.833333
    2     a  544  0.333333
    3     b    7  1.000000
    4     b  332  0.666667
    5     b  766  0.333333
    

    Note that .rank(pct=True) computes a percentage rank, not a percentile. To compute a percentile you could use scipy.stats.percentileofscore.

    import scipy.stats as stats
    df['percentile'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')])
        .apply(lambda ser: 100-pd.Series([stats.percentileofscore(ser, x, kind='rank') 
               for x in ser], index=ser.index)))
    

    yields

      Group  sum      rank percentile
      group  sum                     
    0     a  234  0.833333  50.000000
    1     a  234  0.833333  50.000000
    2     a  544  0.333333   0.000000
    3     b    7  1.000000  66.666667
    4     b  332  0.666667  33.333333
    5     b  766  0.333333   0.000000