Search code examples
pandasjoinmergeconcatenation

Merge Pandas Dataframe: how to add columns and replace values


I have a dataframe df1 and want to merge other (many) dataframes df2 so that:

  • Merge happens on the matching (multi-) indexes
  • New columns are created if missing
  • Values are replaced if the column already exists

enter image description here

What is the correct pandas operation to use and with what arguments? I looked into concat/join/merge/assign/append but did not find it yet.

Code for dataframes:

df1 = pd.DataFrame({'A':['A1', 'A2', 'A3', 'A4'],
                    'B':['B1', 'B2' ,'B3', 'B4'],
                    'C':['C1' ,'C2', 'C3', 'C4']},
                  index = [1,2,3,4])

df2 = pd.DataFrame({'C':['NewC'], 'D':['NewD']},
                  index=[3])

Solution

  • One way is to use combine_first:

    df2.combine_first(df1)
    

    Output:

        A   B     C     D
    1  A1  B1    C1   NaN
    2  A2  B2    C2   NaN
    3  A3  B3  NewC  NewD
    4  A4  B4    C4   NaN
    

    Another way is to use join with fillna:

    df1[['A','B']].join(df2).fillna(df1)
    

    Output:

        A   B     C     D
    1  A1  B1    C1   NaN
    2  A2  B2    C2   NaN
    3  A3  B3  NewC  NewD
    4  A4  B4    C4   NaN
    

    A third way,

    df1a = df1.reindex(df1.columns.union(df2.columns), axis=1)
    df1a.update(df2)
    df1a
    

    Timings:

    %%timeit pd.concat((df1,df2),sort=False).groupby(level=0).last()  
    

    4.56 ms ± 947 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    %%timeit 
    df1a = df1.reindex(df1.columns.union(df2.columns), axis=1)  
    df1a.update(df2)
    df1a
    

    2.93 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    %timeit df1[['A','B']].join(df2).fillna(df1)
    

    5.2 ms ± 89.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    %timeit df2.combine_first(df1)
    

    5.37 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)