Search code examples
pythonpandasmergerowfillna

Fill in NA with other dataframe and then add the rows that are not in the first dataframe


I am currently working with python to merge two dataframes that look like below:

# Primary
df1 = [['A','2021-03','NA',9,'NA'], ['B','2021-09','NA','NA',27], ['C','2021-12','NA',12,28]]
df1_fin = pd.DataFrame(df1, columns=['ID','Date','Value_1','Value_2','Value_3'])

# Secondatry
df2 = [['A','2021-03',80,20,30], ['B','2021-09',90,'NA',20], ['B','2021-12','NA','NA',27], ['D','2020-06',4,12,28]]
df2_fin = pd.DataFrame(df2, columns=['ID','Date','Value_1','Value_2','Value_3'])

I want to perform outer join but keep the value of first dataframe if it already exist. The key columns will be ID and Date.

  1. If the ID and Date matches, the NA value will be replaced by second dataframe and existing values will not be replaced.
  2. If the ID and Date does not matches, new row will be created

The result dataframe will look like below:

ID Date Value_1 Value_2 Value_3
A 2021-03 80 9 30
B 2021-09 90 NA 27
B 2021-12 NA NA 27
C 2021-12 NA 12 28
D 2020-06 4 12 28

Should I fill in NA first and then combine the rest rows? or is there a function that I can define the parameters to perform both actions?


Solution

  • Yes, there's a function for it in pandas combine_first:

    Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two.

    df1_fin.set_index(['ID', 'Date']).combine_first(df2_fin.set_index(['ID', 'Date'])).reset_index()
    

    (Please note that in your example, you provide two dataframes without any NaN values but with the string 'NaN' instead, which has no special meaning. Replace 'NaN' with None in the example to get the intended meaning.)