As the title suggests, I want to combine two Pandas datasets where both has duplicate keys but also considering a certain condition on other columns (in this case the date columns). I have these two datasets:
Dataset A:
ID | Fact | Level | Fact date |
---|---|---|---|
1 | Alcohol abuse | Specialist | 2024-01-23 |
1 | Anxiety disorder | Specialist | 2024-01-23 |
1 | Depression | Advisor | 2024-02-15 |
Dataset B:
ID | User | Action date |
---|---|---|
1 | AA | 2024-01-23 |
1 | BB | 2024-02-17 |
That I want to combine as such:
Dataset AB:
ID | Fact | Level | Fact date | User | Action date |
---|---|---|---|---|---|
1 | Alcohol abuse | Specialist | 2024-01-23 | AA | 2024-01-23 |
1 | Anxiety disorder | Specialist | 2024-01-23 | AA | 2024-01-23 |
1 | Depression | Advisor | 2024-02-15 | BB | 2024-02-17 |
I want to merge rows where the fact date is not later than the action date. But as you see on the third row of the merged database, it will take the data with the closest possible fact date (so BB instead of AA, even though 23 january is earlier than 17 february).
Should the fact date of Depression be 18th february (later than 17th february), then by the left merge it will return an empty information for the third table.
Dataset AB (if Depression has fact date on 18th february):
ID | Fact | Level | Fact date | User | Action date |
---|---|---|---|---|---|
1 | Alcohol abuse | Specialist | 2024-01-23 | AA | 2024-01-23 |
1 | Anxiety disorder | Specialist | 2024-01-23 | AA | 2024-01-23 |
1 | Depression | Advisor | 2024-02-18 |
Any help appreciated!
how='left'
;User
and Action date
where Action Date >= Fact Date
;Action date
.df = pd.merge(df1, df2, on=["ID"], how="left")
df[["Action date", "User"]] = df.loc[
df["Action date"] >= df["Fact date"], ["Action date", "User"]
]
df = df.sort_values("Action date").drop_duplicates(["ID", "Fact", "Level", "Fact date"])
For Fact == Depression
and Fact Date == 2024-02-15
:
ID Fact Level Fact date User Action date
0 1 Alcohol abuse Specialist 2024-01-23 AA 2024-01-23
2 1 Anxiety disorder Specialist 2024-01-23 AA 2024-01-23
5 1 Depression Advisor 2024-02-15 BB 2024-02-17
For Fact == Depression
and Fact Date == 2024-02-18
:
ID Fact Level Fact date User Action date
0 1 Alcohol abuse Specialist 2024-01-23 AA 2024-01-23
2 1 Anxiety disorder Specialist 2024-01-23 AA 2024-01-23
4 1 Depression Advisor 2024-02-18 NaN NaT