I have Data Frame in Python Pandas like below:
Column my_date is in datetime format.
In my real DataFrame I have many more columns.
id | my_date | col1 |
---|---|---|
111 | 2023-05-15 | 1 |
111 | 2023-05-14 | 11 |
111 | 2023-05-13 | 2 |
222 | 2023-10-11 | 3 |
222 | 2023-10-12 | 55 |
And I need to create new column col_x where for each id will be value 1,2,3 and so on based on date in column my_date. So, for each id in column col_x has to be 1 in the early date from my_date and so on.
So, as a result I need something like below:
id | my_date | col1 | col_x |
---|---|---|---|
111 | 2023-05-15 | 1 | 3 |
111 | 2023-05-14 | 11 | 2 |
111 | 2023-05-13 | 2 | 1 |
222 | 2023-10-11 | 3 | 1 |
222 | 2023-12-12 | 55 | 2 |
How can I do that in Python Pandas ?
Looks like you could use a groupby.rank
:
df['col_x'] = df.groupby('id')['my_date'].rank('dense').astype(int)
Which is more or less equivalent to finding the sorting order (e.g. using numpy.argsort
, or rank('first')
), except it will rank the duplicate values equally:
df['col_x'] = df.groupby('id')['my_date'].transform(np.argsort).add(1)
Output:
id my_date col1 col_x
0 111 2023-05-15 1 3
1 111 2023-05-14 11 2
2 111 2023-05-13 2 1
3 222 2023-10-11 3 1
4 222 2023-10-12 55 2
Example with duplicate values:
id my_date col1 rank('dense') argsort+1 rank('first')
0 111 2023-05-15 1 3 4 4
1 111 2023-05-14 11 2 2 2
2 111 2023-05-14 9 2 3 3
3 111 2023-05-13 2 1 1 1
4 222 2023-10-11 3 1 1 1
5 222 2023-10-12 55 2 2 2