I have a dataframe with subjects and dates for a certain measurement. For each subject I want to find if the date in each row of the group corresponds to the first (1), second (2), third (3)... unique date value for that subject.
To clarify this is what I am looking for:
|subject | date | order|
|A | 01.01.2020 | 1|
|A | 01.01.2020 | 1|
|A | 02.01.2020 | 2|
|B | 01.01.2020 | 1|
|B | 02.01.2020 | 2|
|B | 02.01.2020 | 2|
I though about something as bellow, but the for loop is not admissible in the apply function:
df['order']=df.groupby(['subject']).apply(lambda x: i if x['date']=value for i, value in enumerate(x['date'].unique()))
Is there a straightforward way to do this?
Use factorize
in GroupBy.transform
:
df['order1']=df.groupby(['subject'])['date'].transform(lambda x: pd.factorize(x)[0]) + 1
print (df)
subject date order order1
0 A 01.01.2020 1 1
1 A 01.01.2020 1 1
2 A 02.01.2020 2 2
3 B 01.01.2020 1 1
4 B 02.01.2020 2 2
5 B 02.01.2020 2 2
Or you can use GroupBy.rank
, but is necessary convert column date
to datetimes:
df['order2']=df.groupby(['subject'])['date'].rank(method='dense')
print (df)
subject date order order1
0 A 2020-01-01 1 1.0
1 A 2020-01-01 1 1.0
2 A 2020-02-01 2 2.0
3 B 2020-01-01 1 1.0
4 B 2020-02-01 2 2.0
5 B 2020-02-01 2 2.0
Difference of solution is if changed order of datetimes:
print (df)
subject date order (disregarding temporal order of date)
0 A 2020-01-01 1
1 A 2020-03-01 2 <- changed datetime for sample
2 A 2020-02-01 3
3 B 2020-01-01 1
4 B 2020-02-01 2
5 B 2020-02-01 2
df['order1']=df.groupby(['subject'])['date'].transform(lambda x: pd.factorize(x)[0]) + 1
df['order2']=df.groupby(['subject'])['date'].rank(method='dense')
print (df)
subject date order order1 order2
0 A 2020-01-01 1 1 1.0
1 A 2020-03-01 1 2 3.0
2 A 2020-02-01 2 3 2.0
3 B 2020-01-01 1 1 1.0
4 B 2020-02-01 2 2 2.0
5 B 2020-02-01 2 2 2.0
In summary: use the first method if you don't care about the temporal order of date
being reflected in the order
output, or the second method if the temporal order matters and should reflect in the order
output.