I have the following DataFrame containing employment history: (1900-01-01 is used to store a null end date)
id | company | job_title | start_date | end_date | is_current_employer |
---|---|---|---|---|---|
A | Manager | 2023-05-06 | 1900-01-01 | 1 | |
A | Amazon | SWE | 2021-01-19 | 2023-02-02 | 0 |
A | Meta | SWE | 2017-11-13 | 2020-10-24 | 0 |
B | Tesla | Research | 2020-06-06 | 2023-07-14 | 0 |
B | Microsoft | Data Eng. | 2017-04-17 | 2019-12-05 | 0 |
C | Adobe | Intern | 2022-10-12 | 1900-01-01 | 1 |
C | TikTok | Intern | 2023-03-23 | 1900-01-01 | 1 |
I'm hoping to return a DataFrame with one record for each ID, with that ID's most recent employment, like so:
id | company | job_title | start_date | end_date | is_current_employer |
---|---|---|---|---|---|
A | Manager | 2023-05-06 | 1900-01-01 | 1 | |
B | Tesla | Research | 2020-06-06 | 2023-07-14 | 0 |
C | Adobe | Intern | 2022-10-12 | 1900-01-01 | 1 |
If a candidate (ID) has only one current employer, that record should be returned. If a candidate has no current employers, the most recent employer should be returned (max start date). If a candidate has multiple current employers, the record with the earlier start date should be returned.
How can I concisely do this in Python?
You can use group by and first() to do this. First, you must sort by start date to get the most recent employment:
df.sort_values(["start_date"], ascending=False).groupby("id").first()
Output:
id | company | job_title | start_date | end_date | is_current_employer |
---|---|---|---|---|---|
A | Manager | 2023-05-06 | 1900-01-01 | 1 | |
B | Tesla | Research | 2020-06-06 | 2023-07-14 | 0 |
C | TikTok | Intern | 2023-03-23 | 1900-01-01 | 1 |