I want to join 2 tables, with same column names, table 1 has data universe with some NA's, table 2 has valid data for which table 1 has NA values.
I want to populate NA value in table 1 based on table 2 using joins.
Edit: Table 2 doesn't contain all columns as Table 1
I tried this
import pandas as pd
data1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8,]], columns=['A', 'B', 'C'])
data2 = pd.DataFrame([[7, 8, 9]], columns=['A','B','C'])
data1.merge(data2, on=['A','B'], how='left')
Output I expect:
Output I got:
If you're always "joining" based on known columns, then you probably want to concat
and groupby.first
(or last
if you have different values and want to give priority to the new one):
out = pd.concat([data1, data2]).groupby(['A', 'B'], as_index=False).first()
Alternatively:
cols = ['A', 'B']
out = (data1.set_index(cols)
.combine_first(data2.set_index(cols))
.reset_index()
)
Or, assuming you have range index in data1
and no duplicates for A/B in data2
(if not this will produce incorrect results), using a merge
:
out = data1.combine_first(data1[cols].merge(data2, how='left'))
Output:
A B C
0 1 2 3.0
1 4 5 6.0
2 7 8 9.0