I'm having trouble even wording the question, honestly, but it goes roughly like this: I need to perform a groupby that applies an aggregate function not for all rows equal to each value in the grouping column, but for all rows where the value of the grouping column is greater than each value in the column.
I'll be purposefully verbose, just in case. Let's say I have this dataframe:
import pandas as pd
df = pd.DataFrame({
'Weight': [40, 50, 60, 70, 40, 60, 80, 100, 60, 40, 50, 70, 60],
'Height': [150, 160, 170, 180, 190, 160, 150, 180, 170, 200, 210, 160, 180]
})
On which I perform a simple groupby to find a mean value:
gb = df.groupby(['Weight'])['Height'].mean().reset_index()
gb
Which returns
Weight Height
0 40 180
1 50 185
2 60 170
3 70 170
4 80 150
5 100 180
But I need something like this (not real code obviously):
gb = df.groupby(["""a dynamic column or a mask of some sort that marks all rows where 'Weight' is greater than each of its unique values"""])['Height'].mean().reset_index()
gb
I can, of course, achieve the desired result by iterating over each unique value in the column like this:
res_list = []
for w in sorted(df['Weight'].unique().tolist()):
res_list.append(df[df['Weight'] > w]['Height'].mean())
gb = pd.DataFrame({'Weight': sorted(df['Weight'].unique().tolist()), 'Height': res_list})
gb
Which will return exactly what I want:
Weight Height
0 40 172.00
1 50 168.75
2 60 167.50
3 70 165.00
4 80 180.00
5 100 NaN
But this method scales very poorly with number of unique values and number of columns that I need to perform this operation on.
I'm not married to groupby() specifically, but I have a feeling that there's a way to do it, and I just lack the googling skills to find the answer.
Mean is just sum divided by count. Both sum and count can be accumulated in your case:
# Let's add another column for fun
df = pd.DataFrame(
{
"Weight": [40, 50, 60, 70, 40, 60, 80, 100, 60, 40, 50, 70, 60],
"Height": [150, 160, 170, 180, 190, 160, 150, 180, 170, 200, 210, 160, 180],
"Age": np.random.randint(20, 100, 13),
}
)
# for each Weight class, sum up the Height and Age, also count how many rows
# belonging to that class. The [::-1] reverses it so Weight are sorted from
# highest to lowest
tmp = (
df.groupby("Weight")[["Height", "Age"]]
.agg(["sum", "count"])
.swaplevel(axis=1)[::-1]
)
# Take the sum of all rows with Weight higher than the current Weight
tmp = tmp.cumsum() - tmp
# The mean is just the sum divided by count
(tmp["sum"] / tmp["count"])[::-1]