I am currently working with python to merge two dataframes that look like below:
# Primary
df1 = [['A','2021-03','NA',9,'NA'], ['B','2021-09','NA','NA',27], ['C','2021-12','NA',12,28]]
df1_fin = pd.DataFrame(df1, columns=['ID','Date','Value_1','Value_2','Value_3'])
# Secondatry
df2 = [['A','2021-03',80,20,30], ['B','2021-09',90,'NA',20], ['B','2021-12','NA','NA',27], ['D','2020-06',4,12,28]]
df2_fin = pd.DataFrame(df2, columns=['ID','Date','Value_1','Value_2','Value_3'])
I want to perform outer join but keep the value of first dataframe if it already exist. The key columns will be ID and Date.
The result dataframe will look like below:
ID | Date | Value_1 | Value_2 | Value_3 |
---|---|---|---|---|
A | 2021-03 | 80 | 9 | 30 |
B | 2021-09 | 90 | NA | 27 |
B | 2021-12 | NA | NA | 27 |
C | 2021-12 | NA | 12 | 28 |
D | 2020-06 | 4 | 12 | 28 |
Should I fill in NA first and then combine the rest rows? or is there a function that I can define the parameters to perform both actions?
Yes, there's a function for it in pandas combine_first
:
Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two.
df1_fin.set_index(['ID', 'Date']).combine_first(df2_fin.set_index(['ID', 'Date'])).reset_index()
(Please note that in your example, you provide two dataframes without any NaN
values but with the string 'NaN'
instead, which has no special meaning. Replace 'NaN'
with None
in the example to get the intended meaning.)