My data is like and I want to calculate the NDCG
metric
pid query title label predict prob
0 1 A a 0 2 0.9
1 1 A b 2 0 0.8
2 1 A c 1 2 0.7
3 2 D d 3 1 0.8
4 2 D e 4 2 0.9
After groupby
opration
each_pid = data.groupby(data['pid']).apply(lambda g: g.sort_values(['predict', 'prob'], ascending=False))
I got a dataframe like this.
pid query title label predict prob score
pid
1 0 1 A a 0 2 0.9 6
2 1 A c 1 2 0.7 18
1 1 A b 2 0 0.8 42
2 4 2 D e 4 2 0.9 186
3 2 D d 3 1 0.8 90
Now I want another column names weight
which's value is according to the order of each group as following.
pid query title label predict prob score weight
pid
1 0 1 A a 0 2 0.9 6 1
2 1 A c 1 2 0.7 18 2
1 1 A b 2 0 0.8 42 3
2 4 2 D e 4 2 0.9 186 1
3 2 D d 3 1 0.8 90 2
And how to get top2 rows of each group like
pid query title label predict prob score weight
pid
1 0 1 A a 0 2 0.9 6 1
2 1 A c 1 2 0.7 18 2
2 4 2 D e 4 2 0.9 186 1
3 2 D d 3 1 0.8 90 2
Can anyone help?
EDIT. Thanks for @Akshay Kandul and @Allen help. The code works as
data = data.groupby(level=0).head(2)
data['weight'] = data.groupby(level=0).cumcount()+1
You can use cumcount function on group by of index level zero
each_pid['weight'] = each_pid.groupby(level=[0]).cumcount()
output
label pid predict prob query title weight
pid
1 0 0 1 2 0.9 A a 0
2 1 1 2 0.7 A c 1
1 2 1 0 0.8 A b 2
2 4 4 2 2 0.9 D e 0
3 3 2 1 0.8 D d 1
If your really want it to start with 1 then just do this instead:
each_pid['weight'] = each_pid.groupby(level=[0]).cumcount()+1
output
label pid predict prob query title weight
pid
1 0 0 1 2 0.9 A a 1
2 1 1 2 0.7 A c 2
1 2 1 0 0.8 A b 3
2 4 4 2 2 0.9 D e 1
3 3 2 1 0.8 D d 2
If you need top 2 records of each group you can use this code:
print(each_pid.groupby(level=0).head(2))