How can I rank observations within groups where the ranking is based on more than just one column and where the ranking allows for tied ranks?
I know how to calculate aggregated group-level statistics using the groupby()
method and I also know how to rank using multiple columns without groups (see here, here and here). The main problem seems to be getting both ideas (grouping & ranking) to play nicely together.
This other thread has some ideas on how to solve the problem, but its results don't show you which rows are tied - it just returns an array of ever-increasing ranks even when the values are identical. The problem is described in more detail in the example I created below.
import pandas as pd
df = pd.DataFrame({'row_id':[1,2,3,4,5,6,7,8,9,10],
'Group':[1,1,1,1,1,2,2,2,2,2],
'Var1':[100,100,100,200,200,300,300,400,400,400],
'Var2':[5,5,6,7,8,1,1,2,2,3]})
print(df)
# row_id Group Var1 Var2
# 0 1 1 100 5
# 1 2 1 100 5
# 2 3 1 100 6
# 3 4 1 200 7
# 4 5 1 200 8
# 5 6 2 300 1
# 6 7 2 300 1
# 7 8 2 400 2
# 8 9 2 400 2
# 9 10 2 400 3
In the case above, I would like to group using the Group
variable and rank using the Var1
and Var2
variables. Therefore, I expect the output to look like this:
# row_id Group Var1 Var2 Rank
# 0 1 1 100 5 1
# 1 2 1 100 5 1
# 2 3 1 100 6 3
# 3 4 1 200 7 4
# 4 5 1 200 8 5
# 5 6 2 300 1 1
# 6 7 2 300 1 1
# 7 8 2 400 2 3
# 8 9 2 400 2 3
# 9 10 2 400 3 5
Using the data in the example above, if I would like to group using the Group
variable and only rank based on the Var1
column, that would be pretty easy:
df['Rank_Only_Var1'] = df.groupby('Group')['Var1'].rank(method='min', ascending=True)
print(df)
# row_id Group Var1 Var2 Rank_Only_Var1
# 0 1 1 100 5 1.0
# 1 2 1 100 5 1.0
# 2 3 1 100 6 1.0
# 3 4 1 200 7 4.0
# 4 5 1 200 8 4.0
# 5 6 2 300 1 1.0
# 6 7 2 300 1 1.0
# 7 8 2 400 2 3.0
# 8 9 2 400 2 3.0
# 9 10 2 400 3 3.0
However, if I want to group using the Group
variable and rank using the Var1
and Var2
variables, things get hairy. Using the approach suggested by this other post, we arrive at the following results:
df = df.sort_values(['Var1', 'Var1'], ascending=[True, True])
df['overall_rank'] = 1
df['overall_rank'] = df.groupby(['Group'])['overall_rank'].cumsum()
print(df)
# row_id Group Var1 Var2 overall_rank
# 0 1 1 100 5 1
# 1 2 1 100 5 2
# 2 3 1 100 6 3
# 3 4 1 200 7 4
# 4 5 1 200 8 5
# 5 6 2 300 1 1
# 6 7 2 300 1 2
# 7 8 2 400 2 3
# 8 9 2 400 2 4
# 9 10 2 400 3 5
Note how the first and second rows have identical values for Var1
and Var2
, but the first row is ranked 1 and the second row is ranked 2. Those two rows shouldn't have different ranks. Their ranks should be identical and tied, because the values the rank is based on are identical and tied. This problem also happens with rows 6 & 7 as well as with rows 8 & 9.
I even tried adapting the solution from this answer, but it doesn't work when we have a groupby
statement.
How can I rank observations within groups where the ranking is based on more than just one column and where the ranking allows for tied ranks?
It's not clear why you can't use the linked solution with a .groupby
You can also replace the .apply
for better performance:
(df.assign(key = list(map(tuple, df[['Var1', 'Var2']].values)))
.groupby('Group')['key'].rank('dense'))
0 1.0
1 1.0
2 2.0
3 3.0
4 4.0
5 1.0
6 1.0
7 2.0
8 2.0
9 3.0
Name: key, dtype: float64