Search code examples
pythonpandasdataframelambdapandas-groupby

Phyton: How to get the average of the n largest values for each column grouped by id


I'm trying to get the mean for each column while grouped by id. But I don't get it to work as I want to.

The data:

ID       Property3   Property2   Property3
1        10.2        ...         ...
1        20.1
1        51.9
1        15.8
1        12.5
...
1203     104.4
1203     11.5
1203     19.4
1203     23.1

What I got so far:
I got those two tries. But they are both just for one column and I don't know how to do it for more then just one.:

data.groupby('id')['property1'].apply(lambda grp: grp.nlargest(100).mean())
1       37.897989
2       33.059432
3       34.926530
4       33.036137

data.groupby('id').agg({'property1': {lambda grp: grp.nlargest(100).mean()}})
id  property1 <lambda>
1   37.897989
2   33.059432
3   34.926530
4   33.036137

What I want:
Idealy, I would like to have a dataframe as follows:

ID       Property3   Property2   Property3
1        37.8        5.6         2.3
2        33.0        1.5         10.4
3        34.9        91.5        10.3
4        33.0        10.3        14.3

So that each row contains the mean values for the 100 biggest values for EACH column grouped by id.


Solution

  • Use GroupBy.agg with omit columns for processing all columns in DataFrame without ID:

    df = data.groupby('ID').agg(lambda grp: grp.nlargest(100).mean()).reset_index()
    
    print (df)
           ID  Property1  Property2  Property3
    0     1.0       22.1       ...       ...
    1  1203.0       39.6       ...       ...
    

    Or specify columns after groupby:

    df = (data.groupby('ID')['Property1','Property2','Property3']
              .agg(lambda grp: grp.nlargest(100).mean())
              .reset_index())