I have grouped a dataframe and I would like, per group, to replace the values in some columns if they or smaller than some percentile.
So all values within a group that are larger than the 0.95 percentile should be replaced by the 0.95 percentile and all the values that are smaller than the 0.05 percentile should be replaced by the 0.05 percentile.
The dataframe could look like this(example taken from another question):
Two groups: ‘one’ and ‘two’
A B C
0 0.719391 0.091693 one
1 0.951499 0.83716 one
2 0.975212 0.224855 one
3 0.80762 0.031284 one
4 0.63319 0.342889 one
5 0.075102 0.899291 one
6 0.502843 0.773424 one
7 0.032285 0.242476 one
8 0.794938 0.607745 one
9 0.620387 0.574222 one
10 0.446639 0.549749 two
11 0.664324 0.134041 two
12 0.622217 0.505057 two
13 0.670338 0.99087 two
14 0.281431 0.016245 two
15 0.675756 0.185967 two
16 0.145147 0.045686 two
17 0.404413 0.191482 two
18 0.94913 0.943509 two
19 0.164642 0.157013 two
The result for this dataframe should be:
A B C
0 0.719391 0.091693 one
1 0.951499 0.83716 one
2 0.96454115 0.224855 one
3 0.80762 0.05846805 one
4 0.63319 0.342889 one
5 0.075102 0.87133205 one
6 0.502843 0.773424 one
7 0.05155265 0.242476 one
8 0.794938 0.607745 one
9 0.620387 0.574222 one
10 0.446639 0.549749 two
11 0.664324 0.134041 two
12 0.622217 0.505057 two
13 0.670338 0.96955755 two
14 0.281431 0.02949345 two
15 0.675756 0.185967 two
16 0.15391975 0.045686 two
17 0.404413 0.191482 two
18 0.8261117 0.943509 two
19 0.164642 0.157013 two
Notice that for column A rows 2, 7, 16 and 18 have been replaced; and for column B rows 3, 5, 13, and 14 have been replaced.
Does anyone knows how to do this in an efficient way for a large dataframe?
Thanks
You could use groupby
+ quantile
+ clip
:
g = df.groupby('C').transform(lambda x: x.clip(*x.quantile([0.05, 0.95])))
g['C'] = df['C']
A B C
0 0.719391 0.091693 one
1 0.951499 0.837160 one
2 0.964541 0.224855 one
3 0.807620 0.058468 one
4 0.633190 0.342889 one
5 0.075102 0.871332 one
6 0.502843 0.773424 one
7 0.051553 0.242476 one
8 0.794938 0.607745 one
9 0.620387 0.574222 one
10 0.446639 0.549749 two
11 0.664324 0.134041 two
12 0.622217 0.505057 two
13 0.670338 0.969558 two
14 0.281431 0.029493 two
15 0.675756 0.185967 two
16 0.153920 0.045686 two
17 0.404413 0.191482 two
18 0.826112 0.943509 two
19 0.164642 0.157013 two
Sanity Check
np.allclose(e[['A', 'B']].values, g[['A', 'B']].values)
True
Here, e
is the output in your question.