Search code examples
pandasdataframejoinmergeconcatenation

concat two df without any duplicate values


I want to join two dataframe (i.e df_1, df_2) based on date so that a new df_1_2 formed with column as Date, A, B, C where all dates with their resppective values are present without any duplicate value

Current code:

import pandas as pd

# create dictionary

dict1 = {"Date":["2000-01-01", "2000-01-04", "2000-01-05", "2000-01-07"], "A":[99, 93,100,97], "B": [106,107,109,105]}
dict2 = {"Date":["2000-01-01", "2000-01-03", "2000-01-05", "2000-01-07"], "A":[99, 96,100,97], "B": [106,100,109,105], "C":[2,5,8,4]}

# create dataframe using dict1

df_1 = pd.DataFrame(dict1)
df_1["Date"] = pd.to_datetime(df_1["Date"])
df_1.set_index("Date", inplace = True)

# create dataframe using dict2

df_2 = pd.DataFrame(dict2)
df_2["Date"] = pd.to_datetime(df_2["Date"])
df_2.set_index("Date", inplace = True)

# concat df_1 & df_2

df_1_2 = pd.concat([df_1, df_2])
print(df_1_2)

Expected output:

enter image description here


Solution

  • Emma's solution of using combine_first is the most straight forward and most likely the best:

    df_1.combine_first(df_2)
    

    You can also look into merge. Is there a reason you have to set Date as the index in the first portion? Regardless, if you reset the indexes, merge them on Date, A, and B via outer join and then set the index back to Date, you can get the same result as combine_first. Just an alternative for more options:

    (df_1.reset_index().merge(df_2.reset_index(), on = ['Date', 'A', 'B'], how = 'outer')
     .set_index('Date')
     .sort_index())
    

    Output for both:

                 A   B   C
    Date            
    2000-01-01  99  106 2.0
    2000-01-03  96  100 5.0
    2000-01-04  93  107 NaN
    2000-01-05  100 109 8.0
    2000-01-07  97  105 4.0