Search code examples
pythonpandasgroupingpercentile

Replace column values based on percentiles in python


I have grouped a dataframe and I would like, per group, to replace the values in some columns if they or smaller than some percentile. So all values within a group that are larger than the 0.95 percentile should be replaced by the 0.95 percentile and all the values that are smaller than the 0.05 percentile should be replaced by the 0.05 percentile.
The dataframe could look like this(example taken from another question): Two groups: ‘one’ and ‘two’

    A           B           C
0   0.719391    0.091693    one
1   0.951499    0.83716     one
2   0.975212    0.224855    one
3   0.80762     0.031284    one
4   0.63319     0.342889    one
5   0.075102    0.899291    one
6   0.502843    0.773424    one
7   0.032285    0.242476    one
8   0.794938    0.607745    one
9   0.620387    0.574222    one
10  0.446639    0.549749    two
11  0.664324    0.134041    two
12  0.622217    0.505057    two
13  0.670338    0.99087     two
14  0.281431    0.016245    two
15  0.675756    0.185967    two
16  0.145147    0.045686    two
17  0.404413    0.191482    two
18  0.94913     0.943509    two
19  0.164642    0.157013    two

The result for this dataframe should be:

A               B           C
0   0.719391    0.091693    one
1   0.951499    0.83716     one
2   0.96454115  0.224855    one
3   0.80762     0.05846805  one
4   0.63319     0.342889    one
5   0.075102    0.87133205  one
6   0.502843    0.773424    one
7   0.05155265  0.242476    one
8   0.794938    0.607745    one
9   0.620387    0.574222    one
10  0.446639    0.549749    two
11  0.664324    0.134041    two
12  0.622217    0.505057    two
13  0.670338    0.96955755  two
14  0.281431    0.02949345  two
15  0.675756    0.185967    two
16  0.15391975  0.045686    two
17  0.404413    0.191482    two
18  0.8261117   0.943509    two
19  0.164642    0.157013    two

Notice that for column A rows 2, 7, 16 and 18 have been replaced; and for column B rows 3, 5, 13, and 14 have been replaced.

Does anyone knows how to do this in an efficient way for a large dataframe?

Thanks


Solution

  • You could use groupby + quantile + clip:

    g = df.groupby('C').transform(lambda x: x.clip(*x.quantile([0.05, 0.95])))
    g['C'] = df['C']
    
               A         B    C
    0   0.719391  0.091693  one
    1   0.951499  0.837160  one
    2   0.964541  0.224855  one
    3   0.807620  0.058468  one
    4   0.633190  0.342889  one
    5   0.075102  0.871332  one
    6   0.502843  0.773424  one
    7   0.051553  0.242476  one
    8   0.794938  0.607745  one
    9   0.620387  0.574222  one
    10  0.446639  0.549749  two
    11  0.664324  0.134041  two
    12  0.622217  0.505057  two
    13  0.670338  0.969558  two
    14  0.281431  0.029493  two
    15  0.675756  0.185967  two
    16  0.153920  0.045686  two
    17  0.404413  0.191482  two
    18  0.826112  0.943509  two
    19  0.164642  0.157013  two
    

    Sanity Check

    np.allclose(e[['A', 'B']].values, g[['A', 'B']].values)
    True
    

    Here, e is the output in your question.