I have the following dataframe:
ID Job Start Date
1 Driver 1951-01-01
1 Cleaner 2000-05-01
1 Staff 1951-01-01
2 Staff 2000-05-02
2 Staff2 2000-05-01
2 Cleaner 2000-04-01
5 Driver 1951-01-01
I need to create a column named "Primary?" that has the value "Yes" or "No" depending of the Start Date
For each ID, set the column to "Yes" for the earliest "Start Date" for the ID. If there is 2 rows tied with same "Start Date" pick one to set to "Yes" (can be the first one that appears)
All other rows set to "No" for the ID
In that case, each ID would have a row with a "Yes" (just 1 single row) and all other rows to "No" resulting in this dataframe:
ID Job Start Date Primary?
1 Driver 1951-01-01 Yes
1 Cleaner 2000-05-01 No
1 Staff 1951-01-01 No
2 Staff 2000-05-02 No
2 Staff2 2000-05-01 Yes
2 Cleaner 2000-04-01 Yes
5 Driver 1951-01-01 Yes
What is the best way to do it?
# set the primary start-date row as True/False, when its a min for the ID
df['Primary']= df['Start_Date'].eq(df.groupby(['ID'])['Start_Date'].transform(min))
# identify the duplicates start-dates
df.loc[df.duplicated(subset=['ID','Primary'], keep='first'),'Primary' ] = False
df
# Map True/False to Yes/No
df['Primary']=df['Primary'].map({True: 'Yes', False: 'No'})
df
+----+-----+----------+-------------+---------+
| | ID | Job | Start_Date | Primary |
+----+-----+----------+-------------+---------+
| 0 | 1 | Driver | 1951-01-01 | Yes |
| 1 | 1 | Cleaner | 2000-05-01 | No |
| 2 | 1 | Staff | 1951-01-01 | No |
| 3 | 2 | Staff | 2000-05-02 | No |
| 4 | 2 | Staff2 | 2000-05-01 | No |
| 5 | 2 | Cleaner | 2000-04-01 | Yes |
| 6 | 5 | Driver | 1951-01-01 | Yes |
+----+-----+----------+-------------+---------+