Search code examples
pythoncountgroupingpercentile

Aggregate by percentile and count for groups in python


I'm a new python user familiar with R.

I want to calculate user-defined quantiles for groups complete with the count of observations in each group.

In R I would do:

df_sum <- df %>% group_by(group) %>% 
  dplyr::summarise(q85 = quantile(obsval, probs = 0.85, type = 8),
                   n = n())

In python I can get the grouped percentile by:

df_sum = df.groupby(['group'])['obsval'].quantile(0.85)

How do I add the group count to this?

I have tried:

df_sum = df.groupby(['group'])['obsval'].describe(percentile=[0.85])[[count]]
df_sum = df.groupby(['group'])['obsval'].quantile(0.85).describe(['count'])

Example data:

data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df

Expected result:

group  percentile count
A       7.4        5
B       6.55       4

Solution

  • You can use pandas.DataFrame.agg() to apply multiple functions.
    In this case you should use numpy.quantile().

    import pandas as pd
    import numpy as np
    
    data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
    df = pd.DataFrame(data)
    df_sum = df.groupby(['group'])['obsval'].agg([lambda x : np.quantile(x, q=0.85), "count"])
    df_sum.columns = ['percentile', 'count']
    print(df_sum)