Search code examples
pythonpandasdataframegroup-byconditional-statements

Pandas - Conditional grouping


I have the following DataFrame containing employment history: (1900-01-01 is used to store a null end date)

id company job_title start_date end_date is_current_employer
A Google Manager 2023-05-06 1900-01-01 1
A Amazon SWE 2021-01-19 2023-02-02 0
A Meta SWE 2017-11-13 2020-10-24 0
B Tesla Research 2020-06-06 2023-07-14 0
B Microsoft Data Eng. 2017-04-17 2019-12-05 0
C Adobe Intern 2022-10-12 1900-01-01 1
C TikTok Intern 2023-03-23 1900-01-01 1

I'm hoping to return a DataFrame with one record for each ID, with that ID's most recent employment, like so:

id company job_title start_date end_date is_current_employer
A Google Manager 2023-05-06 1900-01-01 1
B Tesla Research 2020-06-06 2023-07-14 0
C Adobe Intern 2022-10-12 1900-01-01 1

If a candidate (ID) has only one current employer, that record should be returned. If a candidate has no current employers, the most recent employer should be returned (max start date). If a candidate has multiple current employers, the record with the earlier start date should be returned.

How can I concisely do this in Python?


Solution

  • You can use group by and first() to do this. First, you must sort by start date to get the most recent employment:

    df.sort_values(["start_date"], ascending=False).groupby("id").first()
    

    Output:

    id company job_title start_date end_date is_current_employer
    A Google Manager 2023-05-06 1900-01-01 1
    B Tesla Research 2020-06-06 2023-07-14 0
    C TikTok Intern 2023-03-23 1900-01-01 1