I have a dataframe with a numeric column and I would link to calculate the percentile of the values in each row for that column considering only previous rows of the column. Here is an example:
+-------+
| col_1 |
+-------+
| 5 |
+-------+
| 4 |
+-------+
| 10 |
+-------+
| 1 |
+-------+
| 7 |
+-------+
I would like to obtain a dataframe like this:
+-------+------------+
| col_1 | percentile |
+-------+------------+
| 5 | 100 |
+-------+------------+
| 4 | 50 |
+-------+------------+
| 10 | 100 |
+-------+------------+
| 1 | 25 |
+-------+------------+
| 7 | 80 |
+-------+------------+
How can I calculate it?
Try as follows.
df.expanding
with min_periods=1
to allow expanding window calculations.Expanding.rank
with pct=True
(and we multiply by 100
).We can assign the result directly to the new column percentile
:
import pandas as pd
data = {'col_1':[5,4,10,1,7]}
df = pd.DataFrame(data)
df['percentile'] = df['col_1'].expanding(min_periods=1).rank(pct=True).mul(100)
print(df)
col_1 percentile
0 5 100.0
1 4 50.0
2 10 100.0
3 1 25.0
4 7 80.0
Update: Expanding.rank
was added to pandas
in version 1.4.0
. For earlier versions, you could for instance try:
temp = df['col_1'].expanding(min_periods=1).agg(['rank','count'])
df['percentile'] = (temp['rank']/temp['count']).mul(100)
print(df)
col_1 percentile
0 5 100.0
1 4 50.0
2 10 100.0
3 1 25.0
4 7 80.0
Or, as a one-liner:
df['percentile'] = df['col_1'].expanding(min_periods=1)\
.apply(lambda x: (x.rank()/x.count()).to_numpy()[-1]*100)