I am merging two dataframes together as a left join. However, if the value in a specific column is either blank or NaN, I'd like to replace that value from the "right" dataframe (and only in that case. Otherwise, I want to ignore the 'Cost' data in df2)
df1 = pd.DataFrame({
'ID':[1,2,3,4,5,6],
'Version':[1,1,2,2,1,2],
'Cost':[17,np.nan,24,21,'',8]})
df2 = pd.DataFrame({
'ID':[1,2,3,4,5,6,7,8,9],
'Color':["Red","Orange","Green","Blue","Indigo", "Violet","Black","White","Gold"],
'UnUsedData': ['foo','bar','foo','bar','foo','bar','foo','bar','foo'],
'Cost':[17,34,54,28,22,8,43,23,12]})
The merge statement is:
df_new = pd.merge(df1, df2[['ID','Color']], on ='ID', how ='left')
Which yields:
ID Version Cost Color
0 1 1 17 Red
1 2 1 NaN Orange
2 3 2 24 Green
3 4 2 21 Blue
4 5 1 Indigo
5 6 2 8 Violet
But what I want the output to look like is: [the cost column values change in index rows #s 1 and 4]
ID Version Cost Color
0 1 1 17 Red
1 2 1 34 Orange
2 3 2 24 Green
3 4 2 21 Blue
4 5 1 22 Indigo
5 6 2 8 Violet
I could loop through the individual values of the cost column of df_new, and then lookup the value in df2 for each one that is blank or NaN, but it seems like there would be a more elegant/simpler approach. Maybe somehow using fillna()? The examples of that I've seen seem to be replacing NaN with a constant value, rather than something that varies according to the item.
You can use combine_first
to get the first non-na information:
# merge
dfx = pd.merge(df1, df2[['ID','Color','Cost']], on ='ID', how ='left')
# replace empty space with NAN
dfx = dfx.replace("", np.nan)
# coalesce cost column to get first non NA value
dfx['Cost'] = dfx['Cost_x'].combine_first(dfx['Cost_y']).astype(int)
# remove the cols
dfx = dfx.drop(['Cost_x', 'Cost_y'], 1)
print(dfx)
ID Version Color Cost
0 1 1 Red 17
1 2 1 Orange 34
2 3 2 Green 24
3 4 2 Blue 21
4 5 1 Indigo 22
5 6 2 Violet 8