Search code examples
pythonpandasdataframerunning-count

How to add one column according to the order after `groupby` in pandas?


My data is like and I want to calculate the NDCG metric

  pid query title  label  predict  prob
0   1     A     a      0        2   0.9
1   1     A     b      2        0   0.8
2   1     A     c      1        2   0.7
3   2     D     d      3        1   0.8
4   2     D     e      4        2   0.9

After groupby opration

each_pid = data.groupby(data['pid']).apply(lambda g: g.sort_values(['predict', 'prob'], ascending=False))

I got a dataframe like this.

      pid query title  label  predict  prob  score
pid                                               
1   0   1     A     a      0        2   0.9      6
    2   1     A     c      1        2   0.7     18
    1   1     A     b      2        0   0.8     42
2   4   2     D     e      4        2   0.9    186
    3   2     D     d      3        1   0.8     90

Now I want another column names weight which's value is according to the order of each group as following.

      pid query title  label  predict  prob  score  weight
pid                                               
1   0   1     A     a      0        2   0.9      6    1
    2   1     A     c      1        2   0.7     18    2
    1   1     A     b      2        0   0.8     42    3
2   4   2     D     e      4        2   0.9    186    1
    3   2     D     d      3        1   0.8     90    2

And how to get top2 rows of each group like

      pid query title  label  predict  prob  score  weight
pid                                               
1   0   1     A     a      0        2   0.9      6    1
    2   1     A     c      1        2   0.7     18    2
2   4   2     D     e      4        2   0.9    186    1
    3   2     D     d      3        1   0.8     90    2

Can anyone help?

EDIT. Thanks for @Akshay Kandul and @Allen help. The code works as

data = data.groupby(level=0).head(2)
data['weight'] = data.groupby(level=0).cumcount()+1

Solution

  • You can use cumcount function on group by of index level zero

    each_pid['weight'] = each_pid.groupby(level=[0]).cumcount()
    

    output

            label  pid  predict  prob query title  weight
    pid                                                 
    1   0      0    1        2   0.9     A     a       0
        2      1    1        2   0.7     A     c       1
        1      2    1        0   0.8     A     b       2
    2   4      4    2        2   0.9     D     e       0
        3      3    2        1   0.8     D     d       1
    

    If your really want it to start with 1 then just do this instead:

    each_pid['weight'] = each_pid.groupby(level=[0]).cumcount()+1
    

    output

            label  pid  predict  prob query title  weight
    pid                                                 
    1   0      0    1        2   0.9     A     a       1
        2      1    1        2   0.7     A     c       2
        1      2    1        0   0.8     A     b       3
    2   4      4    2        2   0.9     D     e       1
        3      3    2        1   0.8     D     d       2
    

    If you need top 2 records of each group you can use this code:

    print(each_pid.groupby(level=0).head(2))