Search code examples
pythonpandasmergefillna

Replace NaN in dataframe during merging / left join


I am merging two dataframes together as a left join. However, if the value in a specific column is either blank or NaN, I'd like to replace that value from the "right" dataframe (and only in that case. Otherwise, I want to ignore the 'Cost' data in df2)

df1 = pd.DataFrame({
         'ID':[1,2,3,4,5,6],
         'Version':[1,1,2,2,1,2],
         'Cost':[17,np.nan,24,21,'',8]})

df2 = pd.DataFrame({
         'ID':[1,2,3,4,5,6,7,8,9],
         'Color':["Red","Orange","Green","Blue","Indigo", "Violet","Black","White","Gold"],
         'UnUsedData': ['foo','bar','foo','bar','foo','bar','foo','bar','foo'],
         'Cost':[17,34,54,28,22,8,43,23,12]})

The merge statement is:

df_new = pd.merge(df1, df2[['ID','Color']], on ='ID', how ='left')

Which yields:

   ID  Version Cost   Color
0   1        1   17     Red
1   2        1   NaN  Orange
2   3        2   24   Green
3   4        2   21    Blue
4   5        1       Indigo
5   6        2    8  Violet

But what I want the output to look like is: [the cost column values change in index rows #s 1 and 4]

   ID  Version Cost   Color
0   1        1   17   Red
1   2        1   34   Orange
2   3        2   24   Green
3   4        2   21   Blue
4   5        1   22   Indigo
5   6        2    8   Violet

I could loop through the individual values of the cost column of df_new, and then lookup the value in df2 for each one that is blank or NaN, but it seems like there would be a more elegant/simpler approach. Maybe somehow using fillna()? The examples of that I've seen seem to be replacing NaN with a constant value, rather than something that varies according to the item.


Solution

  • You can use combine_first to get the first non-na information:

    # merge
    dfx = pd.merge(df1, df2[['ID','Color','Cost']], on ='ID', how ='left')
    
    # replace empty space with NAN
    dfx = dfx.replace("", np.nan)
    
    # coalesce cost column to get first non NA value
    dfx['Cost'] = dfx['Cost_x'].combine_first(dfx['Cost_y']).astype(int)
    
    # remove the cols
    dfx = dfx.drop(['Cost_x', 'Cost_y'], 1)
    print(dfx)
    
       ID  Version   Color  Cost
    0   1        1     Red    17
    1   2        1  Orange    34
    2   3        2   Green    24
    3   4        2    Blue    21
    4   5        1  Indigo    22
    5   6        2  Violet     8