Search code examples
pythonpandasdataframedatedatetime

How to create new column in Data Frame to sort values for each id based on date column in Python Pandas?


I have Data Frame in Python Pandas like below:

  • Column my_date is in datetime format.

  • In my real DataFrame I have many more columns.

    id my_date col1
    111 2023-05-15 1
    111 2023-05-14 11
    111 2023-05-13 2
    222 2023-10-11 3
    222 2023-10-12 55

And I need to create new column col_x where for each id will be value 1,2,3 and so on based on date in column my_date. So, for each id in column col_x has to be 1 in the early date from my_date and so on.

So, as a result I need something like below:

id my_date col1 col_x
111 2023-05-15 1 3
111 2023-05-14 11 2
111 2023-05-13 2 1
222 2023-10-11 3 1
222 2023-12-12 55 2

How can I do that in Python Pandas ?


Solution

  • Looks like you could use a groupby.rank:

    df['col_x'] = df.groupby('id')['my_date'].rank('dense').astype(int)
    

    Which is more or less equivalent to finding the sorting order (e.g. using numpy.argsort, or rank('first')), except it will rank the duplicate values equally:

    df['col_x'] = df.groupby('id')['my_date'].transform(np.argsort).add(1)
    

    Output:

        id    my_date  col1  col_x
    0  111 2023-05-15     1      3
    1  111 2023-05-14    11      2
    2  111 2023-05-13     2      1
    3  222 2023-10-11     3      1
    4  222 2023-10-12    55      2
    

    Example with duplicate values:

        id    my_date  col1  rank('dense')  argsort+1  rank('first')
    0  111 2023-05-15     1              3          4              4
    1  111 2023-05-14    11              2          2              2
    2  111 2023-05-14     9              2          3              3
    3  111 2023-05-13     2              1          1              1
    4  222 2023-10-11     3              1          1              1
    5  222 2023-10-12    55              2          2              2