I have one categorical variable and two numeric cols:
np.random.seed(123)
df = pd.DataFrame({'group' : ['a']*10+['b']*10,
'var1' : np.random.randn(20),
'var2' : np.random.randint(10,size=20)})
I want to find, by group
, the mean of var1
after filtering to constrict df
to the top-quartile var2
values by group. That is, the threshold for each group would be:
thresh = df.groupby('group')['var2'].quantile(0.75)
Here's a loopy demonstration of what I want to end up with:
for group, frame in df.groupby('group'):
print(frame[frame.var2 >= frame.var2.quantile(0.75)].var1.mean())
# -1.4713362407192072
# 0.15512098976530683
The result should be a DataFrame with (group
, var
) columns or a Series indexed by group
. I believe the solution may involve .transform
/.apply
but am stuck there.
Similar to @Bharathshetty's answer but making the function a little more flexible rather than incorporating it to a lambda:
def thresh_filter(obj, sort_var, tgt_var, q=0.75, stat='mean'):
thresh = obj[sort_var].quantile(q=q)
return getattr(obj[obj[sort_var] >= thresh][tgt_var], stat)()
print(df.groupby('group').apply(lambda x: thresh_filter(x, 'var2', 'var1')))
group
a -1.47134
b 0.15512
dtype: float64