Search code examples
pythonpandasdataframejoinmerge

Join 2 dataframes with same column but different values (Python)


I want to join 2 tables, with same column names, table 1 has data universe with some NA's, table 2 has valid data for which table 1 has NA values.

I want to populate NA value in table 1 based on table 2 using joins.

Edit: Table 2 doesn't contain all columns as Table 1

I tried this

import pandas as pd

data1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8,]], columns=['A', 'B', 'C'])
data2 = pd.DataFrame([[7, 8, 9]], columns=['A','B','C'])

data1.merge(data2, on=['A','B'], how='left')

Output I expect:

enter image description here

Output I got:

enter image description here


Solution

  • If you're always "joining" based on known columns, then you probably want to concat and groupby.first (or last if you have different values and want to give priority to the new one):

    out = pd.concat([data1, data2]).groupby(['A', 'B'], as_index=False).first()
    

    Alternatively:

    cols = ['A', 'B']
    
    out = (data1.set_index(cols)
           .combine_first(data2.set_index(cols))
           .reset_index()
          )
    

    Or, assuming you have range index in data1 and no duplicates for A/B in data2 (if not this will produce incorrect results), using a merge:

    out = data1.combine_first(data1[cols].merge(data2, how='left'))
    

    Output:

       A  B    C
    0  1  2  3.0
    1  4  5  6.0
    2  7  8  9.0