Search code examples
pythonpandasdataframeanalysis

How to create order in pandas dataframe groups?


I have data dataframe where in which I want to add column which will indicate the order of the of the date for each user_id separately, so it will be like in TABLE2.

TABLE1:

user_id         received_at          action    
0043e1a6-52e4   2018-01-05 12:32:10  A                 
                2018-01-05 12:33:13  A                 
                2018-01-05 12:42:12  B                 
0070f782-29f4   2018-01-06 01:41:18  A                 
                2018-01-06 01:42:12  A                 
                2018-01-06 01:43:11  B                 
                2018-01-06 01:44:18  C                               
008aa58a-84a5   2018-01-06 14:22:13  A                 
                2018-01-06 14:23:18  A                 
                2018-01-06 14:24:13  A                 
                2018-01-06 14:25:18  C 

to something like this

TABLE2:

user_id         received_at          action  order_n 
0043e1a6-52e4   2018-01-05 12:32:10  A       1               
                2018-01-05 12:33:13  A       2               
                2018-01-05 12:42:12  B       3               
0070f782-29f4   2018-01-06 01:41:18  A       1               
                2018-01-06 01:42:12  A       2               
                2018-01-06 01:43:11  B       3               
                2018-01-06 01:44:18  C       4                             
008aa58a-84a5   2018-01-06 14:22:13  A       1               
                2018-01-06 14:23:18  A       2               
                2018-01-06 14:24:13  A       3               
                2018-01-06 14:25:18  C       4  

Are there are more efficient and maybe simpler solution than doing iteration for each user_id separately?


Solution

  • groupby user_id and get the rank using received_at

    df['count_n'] = df.groupby('user_id').received_at.apply(pd.Series.rank)
    

    This doesn't require a sorting step & will assign the correct rank even if the data frame is not sorted by received_at within each group

    if the column user_id is set as an index (as your sample data seems to indicate), you could alternative use the following instead. Although, in recent versions of pandas, grouping by named indexes also works (i.e. the above might work)

    df.groupby(level=0).received_at.apply(pd.Series.rank)