I have a dataframe called 'df1':
Name Type Destination Data1 Data2
Bob Car NY asdf dsfg
Liz Car NY asdf dsfg
And another called 'df2':
Name Type Destination Data1 Data2
Bob Train LA asdf dsfg
Liz Car NY asdf dsfg
I want to combine them together based on 3 values to form a 'key': Name, Type, Destination to end up with:
Name Type Destination
Bob Car NY
Bob Train LA
Liz Car NY
Where there are no repeats and the row with Bob, Train, LA... is added since it is a unique entry.
So far I have:
new_df = pd.concat([df1.Name, df2.Name]).drop_duplicates().sort_values(ascending=True).reset_index(drop=True)
but that only works when trying to combine a unique list based off a singular key.
pandas.DataFrame.duplicated
, which return Boolean Series denoting duplicate rows.
True
for all the duplicated rows, so to keep the non-duplicate rows, use ~
, which is (NOT
)..iloc
or df1[['Name', 'Type', 'Destination']]
or they can be dropped after using concat
.import pandas as pd
data1 = {'Name': ['Bob', 'Liz'], 'Type': ['Car', 'Car'], 'Destination': ['NY', 'NY'], 'Data1': ['asdf', 'asdf'], 'Data2': ['dsfg', 'dsfg']}
data2 = {'Name': ['Bob', 'Liz'], 'Type': ['Train', 'Car'], 'Destination': ['LA', 'NY'], 'Data1': ['asdf', 'asdf'], 'Data2': ['dsfg', 'dsfg']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# concat all the desired rows
dfc = pd.concat([df1.iloc[:, :3], df2.iloc[:, :3]])
# drop the duplicated row
dfc = dfc[~dfc.duplicated()]
# display(dfc)
Name Type Destination
0 Bob Car NY
1 Liz Car NY
0 Bob Train LA
subset
parameter of pandas.DataFrame.drop_duplicates
allows for specifying which columns to use when checking for duplicates.dfc = pd.concat([df1, df2]).drop_duplicates(subset=['Name', 'Type', 'Destination'])
# display(dfc)
Name Type Destination Data1 Data2
0 Bob Car NY asdf dsfg
1 Liz Car NY asdf dsfg
0 Bob Train LA asdf dsfg