I want to join two dataframe (i.e df_1, df_2) based on date so that a new df_1_2 formed with column as Date, A, B, C where all dates with their resppective values are present without any duplicate value
Current code:
import pandas as pd
# create dictionary
dict1 = {"Date":["2000-01-01", "2000-01-04", "2000-01-05", "2000-01-07"], "A":[99, 93,100,97], "B": [106,107,109,105]}
dict2 = {"Date":["2000-01-01", "2000-01-03", "2000-01-05", "2000-01-07"], "A":[99, 96,100,97], "B": [106,100,109,105], "C":[2,5,8,4]}
# create dataframe using dict1
df_1 = pd.DataFrame(dict1)
df_1["Date"] = pd.to_datetime(df_1["Date"])
df_1.set_index("Date", inplace = True)
# create dataframe using dict2
df_2 = pd.DataFrame(dict2)
df_2["Date"] = pd.to_datetime(df_2["Date"])
df_2.set_index("Date", inplace = True)
# concat df_1 & df_2
df_1_2 = pd.concat([df_1, df_2])
print(df_1_2)
Expected output:
Emma's solution of using combine_first
is the most straight forward and most likely the best:
df_1.combine_first(df_2)
You can also look into merge
. Is there a reason you have to set Date as the index in the first portion? Regardless, if you reset the indexes, merge them on Date
, A
, and B
via outer
join and then set the index back to Date
, you can get the same result as combine_first
. Just an alternative for more options:
(df_1.reset_index().merge(df_2.reset_index(), on = ['Date', 'A', 'B'], how = 'outer')
.set_index('Date')
.sort_index())
Output for both:
A B C
Date
2000-01-01 99 106 2.0
2000-01-03 96 100 5.0
2000-01-04 93 107 NaN
2000-01-05 100 109 8.0
2000-01-07 97 105 4.0