Search code examples
pythonpandasdataframemergeconcatenation

How to get unique dataframe rows based on multiple keys


I have a dataframe called 'df1':

Name Type Destination Data1 Data2
Bob   Car         NY   asdf  dsfg
Liz   Car         NY   asdf  dsfg

And another called 'df2':

Name Type Destination Data1 Data2
Bob  Train         LA   asdf  dsfg
Liz   Car          NY   asdf  dsfg

I want to combine them together based on 3 values to form a 'key': Name, Type, Destination to end up with:

Name Type Destination
Bob   Car         NY
Bob Train         LA
Liz   Car         NY

Where there are no repeats and the row with Bob, Train, LA... is added since it is a unique entry.

So far I have:

new_df = pd.concat([df1.Name, df2.Name]).drop_duplicates().sort_values(ascending=True).reset_index(drop=True)

but that only works when trying to combine a unique list based off a singular key.


Solution

    • Use pandas.DataFrame.duplicated, which return Boolean Series denoting duplicate rows.
      • The entire row is used to determine duplicates.
      • Returns True for all the duplicated rows, so to keep the non-duplicate rows, use ~, which is (NOT).
    • If only specific columns are wanted in the output, they must be specified with .iloc or df1[['Name', 'Type', 'Destination']] or they can be dropped after using concat.
    import pandas as pd
    
    data1 = {'Name': ['Bob', 'Liz'], 'Type': ['Car', 'Car'], 'Destination': ['NY', 'NY'], 'Data1': ['asdf', 'asdf'], 'Data2': ['dsfg', 'dsfg']}
    data2 = {'Name': ['Bob', 'Liz'], 'Type': ['Train', 'Car'], 'Destination': ['LA', 'NY'], 'Data1': ['asdf', 'asdf'], 'Data2': ['dsfg', 'dsfg']}
    
    df1 = pd.DataFrame(data1)
    df2 = pd.DataFrame(data2)
    
    # concat all the desired rows
    dfc = pd.concat([df1.iloc[:, :3], df2.iloc[:, :3]])
    
    # drop the duplicated row
    dfc = dfc[~dfc.duplicated()]
    
    # display(dfc)
    
      Name   Type Destination
    0  Bob    Car          NY
    1  Liz    Car          NY
    0  Bob  Train          LA
    

    Option 2

    dfc = pd.concat([df1, df2]).drop_duplicates(subset=['Name', 'Type', 'Destination'])
    
    # display(dfc)
    
      Name   Type Destination Data1 Data2
    0  Bob    Car          NY  asdf  dsfg
    1  Liz    Car          NY  asdf  dsfg
    0  Bob  Train          LA  asdf  dsfg