Search code examples
pythonpandasgroup-by

Pandas groupby all rows greater than _each_ value in column


I'm having trouble even wording the question, honestly, but it goes roughly like this: I need to perform a groupby that applies an aggregate function not for all rows equal to each value in the grouping column, but for all rows where the value of the grouping column is greater than each value in the column.

I'll be purposefully verbose, just in case. Let's say I have this dataframe:

import pandas as pd
df = pd.DataFrame({
    'Weight': [40, 50, 60, 70, 40, 60, 80, 100, 60, 40, 50, 70, 60],
    'Height': [150, 160, 170, 180, 190, 160, 150, 180, 170, 200, 210, 160, 180]
    })

On which I perform a simple groupby to find a mean value:

gb = df.groupby(['Weight'])['Height'].mean().reset_index()
gb

Which returns

   Weight  Height
0      40     180
1      50     185
2      60     170
3      70     170
4      80     150
5     100     180

But I need something like this (not real code obviously):

gb = df.groupby(["""a dynamic column or a mask of some sort that marks all rows where 'Weight' is greater than each of its unique values"""])['Height'].mean().reset_index()
gb

I can, of course, achieve the desired result by iterating over each unique value in the column like this:

res_list = []
for w in sorted(df['Weight'].unique().tolist()):
    res_list.append(df[df['Weight'] > w]['Height'].mean())
gb = pd.DataFrame({'Weight': sorted(df['Weight'].unique().tolist()), 'Height': res_list})
gb

Which will return exactly what I want:

   Weight  Height
0      40  172.00
1      50  168.75
2      60  167.50
3      70  165.00
4      80  180.00
5     100     NaN

But this method scales very poorly with number of unique values and number of columns that I need to perform this operation on.

I'm not married to groupby() specifically, but I have a feeling that there's a way to do it, and I just lack the googling skills to find the answer.


Solution

  • Mean is just sum divided by count. Both sum and count can be accumulated in your case:

    # Let's add another column for fun
    df = pd.DataFrame(
        {
            "Weight": [40, 50, 60, 70, 40, 60, 80, 100, 60, 40, 50, 70, 60],
            "Height": [150, 160, 170, 180, 190, 160, 150, 180, 170, 200, 210, 160, 180],
            "Age": np.random.randint(20, 100, 13),
        }
    )
    
    # for each Weight class, sum up the Height and Age, also count how many rows
    # belonging to that class. The [::-1] reverses it so Weight are sorted from
    # highest to lowest
    tmp = (
        df.groupby("Weight")[["Height", "Age"]]
        .agg(["sum", "count"])
        .swaplevel(axis=1)[::-1]
    )
    
    # Take the sum of all rows with Weight higher than the current Weight
    tmp = tmp.cumsum() - tmp
    
    # The mean is just the sum divided by count
    (tmp["sum"] / tmp["count"])[::-1]