I have two datasets like this
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'id': [1, 2,3,4,5], 'first': [np.nan,np.nan,1,0,np.nan], 'second': [1,np.nan,np.nan,np.nan,0]})
df2 = pd.DataFrame({'id': [1, 2,3,4,5, 6], 'first': [np.nan,1,np.nan,np.nan,0, 1], 'third': [1,0,np.nan,1,1, 0]})
And I want to get
result = pd.merge(df1, df2, left_index=True, right_index=True,on='id', how= 'outer')
result['first']= result[["first_x", "first_y"]].sum(axis=1)
result.loc[(result['first_x'].isnull()) & (result['first_y'].isnull()), 'first'] = np.nan
result.drop(['first_x','first_y'] , 1)
id second third first
0 1 1.0 1.0 NaN
1 2 NaN 0.0 1.0
2 3 NaN NaN 1.0
3 4 NaN 1.0 0.0
4 5 0.0 1.0 0.0
5 6 NaN 0.0 1.0
The problem is that the real dataset includes about 200 variables and my way is very long. How to make it easier? Thanks
You should be able to use combine_first
:
>>> df1.set_index('id').combine_first(df2.set_index('id'))
first second third
id
1 NaN 1 1
2 1 NaN 0
3 1 NaN NaN
4 0 NaN 1
5 0 0 1
6 1 NaN 0