Search code examples
pythonpandasdatemergeconditional-statements

Merge pandas dataframes on duplicate keys with certain conditions on other columns


As the title suggests, I want to combine two Pandas datasets where both has duplicate keys but also considering a certain condition on other columns (in this case the date columns). I have these two datasets:

Dataset A:

ID Fact Level Fact date
1 Alcohol abuse Specialist 2024-01-23
1 Anxiety disorder Specialist 2024-01-23
1 Depression Advisor 2024-02-15

Dataset B:

ID User Action date
1 AA 2024-01-23
1 BB 2024-02-17

That I want to combine as such:

Dataset AB:

ID Fact Level Fact date User Action date
1 Alcohol abuse Specialist 2024-01-23 AA 2024-01-23
1 Anxiety disorder Specialist 2024-01-23 AA 2024-01-23
1 Depression Advisor 2024-02-15 BB 2024-02-17

I want to merge rows where the fact date is not later than the action date. But as you see on the third row of the merged database, it will take the data with the closest possible fact date (so BB instead of AA, even though 23 january is earlier than 17 february).

Should the fact date of Depression be 18th february (later than 17th february), then by the left merge it will return an empty information for the third table.

Dataset AB (if Depression has fact date on 18th february):

ID Fact Level Fact date User Action date
1 Alcohol abuse Specialist 2024-01-23 AA 2024-01-23
1 Anxiety disorder Specialist 2024-01-23 AA 2024-01-23
1 Depression Advisor 2024-02-18

Any help appreciated!


Solution

    1. Merge the dataframes with how='left';
    2. Keep only values for User and Action date where Action Date >= Fact Date;
    3. Drop duplicates after sorting the rows to keep the latest date for Action date.
    df = pd.merge(df1, df2, on=["ID"], how="left")
    df[["Action date", "User"]] = df.loc[
        df["Action date"] >= df["Fact date"], ["Action date", "User"]
    ]
    df = df.sort_values("Action date").drop_duplicates(["ID", "Fact", "Level", "Fact date"])
    

    For Fact == Depression and Fact Date == 2024-02-15:

       ID              Fact       Level  Fact date User Action date
    0   1     Alcohol abuse  Specialist 2024-01-23   AA  2024-01-23
    2   1  Anxiety disorder  Specialist 2024-01-23   AA  2024-01-23
    5   1        Depression     Advisor 2024-02-15   BB  2024-02-17
    

    For Fact == Depression and Fact Date == 2024-02-18:

       ID              Fact       Level  Fact date User Action date
    0   1     Alcohol abuse  Specialist 2024-01-23   AA  2024-01-23
    2   1  Anxiety disorder  Specialist 2024-01-23   AA  2024-01-23
    4   1        Depression     Advisor 2024-02-18  NaN         NaT