Search code examples
pandasdataframedata-manipulation

Pandas idxmin equivalent for mean


I am trying to filter a very large dataframe that looks like this:

unique id x y
1 1 2
1 2 3
1 3 4
2 1 2
2 2 3
2 3 4

to only contain the mean values for each unique id, (e.g. filtered on 'x') like this:

unique id x y
1 2 3
2 2 3

I have tried to filter the data by doing this:

filtered_series = df.groupby("uniqueID")[some_column].mean()

Then the question is how I can filter the original dataframe based on the series above.

Now i am trying to get my original dataframe containing many rows (the same id can obviously appear many times), but with only one row per unique id.

I have tried many things, including doing an inner join like this:

df.merge(filtered_series, how="inner", on=["uniqueID", some_column])

Strangely this yielded even more rows in my df instead of filtering it.

I managed to quickly do the same task for finding the min/max value, easily achieved by the following code:

new_df = df.loc[df.groupby("uniqueID")[some_column].idxmin/max()]

Obviously there is no idxmean function, but perhaps there is a convenient way to achieve the same result. Thank you for your help!


Solution

  • IIUC, you can create a new column with the difference from the mean, use idxmin on that to get the indices, then use that to filter the original dataframe:

    mean = df.groupby("unique id")["x"].transform("mean")
    df["diff_from_mean"] = (df["x"] - mean).abs()
    idx_mean = df.groupby("unique id")["diff_from_mean"].idxmin()
    df = df.loc[idx_mean].drop(columns="diff_from_mean")
    
       unique id  x  y
    1          1  2  3
    4          2  2  3