Search code examples
pythonpandaspandas-groupbysplit-apply-combine

Quantile threshold/filter within pandas groupby


I have one categorical variable and two numeric cols:

np.random.seed(123)    
df = pd.DataFrame({'group' : ['a']*10+['b']*10,
                   'var1' : np.random.randn(20),
                   'var2' : np.random.randint(10,size=20)})

I want to find, by group, the mean of var1 after filtering to constrict df to the top-quartile var2 values by group. That is, the threshold for each group would be:

thresh = df.groupby('group')['var2'].quantile(0.75)

Here's a loopy demonstration of what I want to end up with:

for group, frame in df.groupby('group'):
    print(frame[frame.var2 >= frame.var2.quantile(0.75)].var1.mean())
# -1.4713362407192072
# 0.15512098976530683

The result should be a DataFrame with (group, var) columns or a Series indexed by group. I believe the solution may involve .transform/.apply but am stuck there.


Solution

  • Similar to @Bharathshetty's answer but making the function a little more flexible rather than incorporating it to a lambda:

    def thresh_filter(obj, sort_var, tgt_var, q=0.75, stat='mean'):
        thresh = obj[sort_var].quantile(q=q)
        return getattr(obj[obj[sort_var] >= thresh][tgt_var], stat)()
    
    print(df.groupby('group').apply(lambda x: thresh_filter(x, 'var2', 'var1')))
    group
    a   -1.47134
    b    0.15512
    dtype: float64