I have 2 dataframes (say d1
and d2
) both with columns c1
and c2
. I'd like to concatenate them. However, for all values that occur in column c1
in both tables I'd like to retain only the row from d1
.
Example:
df1 = pd.DataFrame({"Customer":["Alice", "Bob", "John"],
"Status":["closed","in-progress","closed"]})
df2 = pd.DataFrame({"Customer":["Alice", "Lara", "Santa"],
"Status":["in-progress","in-progress","closed"]})
desired_result = pd.DataFrame({"Customer":["Alice", "Bob", "John", "Lara", "Santa"],
"Status":["closed","in-progress","closed","in-progress","closed"]})
d1:
Customer Status
0 Alice closed
1 Bob in-progress
2 John closed
d2:
Customer Status
0 Alice in-progress
1 Lara in-progress
2 Santa closed
desired_result:
Customer Status
0 Alice closed
1 Bob in-progress
2 John closed
3 Lara in-progress
4 Santa closed
Notice Customer Alice. She occurs in both d1.Customer and d2.Customer, so only the corresponding row from d1 needs to be retained. All other customers in d1 and d2 are unique so their corresponding rows end up in the final table. How can I accomplish this?
What you're asking to do is to concatenate the dataframes, and then drop the duplicates, keeping the first item for each duplicate. That is what the code below does:
pd.concat([df1, df2], axis=0, ignore_index=True).drop_duplicates(subset=['Customer'], keep='first')
A couple of notes:
keep='first'
is unnecessary (just there to be illustrative)You can read the documentation for these functions here:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
https://pandas.pydata.org/docs/reference/api/pandas.concat.html