Search code examples
pythonpandasmergesql-updateinsert-update

Python Pandas dataframes merge update


My problem is kind of a bit tricky (similar to sql merge/update), and not understanding how to fix: ( I am giving a small sample of the dataframes below)

I have two dataframes :

  dfOld 

     A   B  C  D  E
     x1  x2  g  h  r
     q1  q2  x  y  s
     t1  t2  h  j  u
     p1  p2  r  s  t

AND

 dfNew 

          A   B  C  D   E
          x1  x2  a  b  c
          s1  s2  p  q  r
          t1  t2  h  j  u
          q1  q2  x  y  z

We want to merge the dataframes with the following rule : ( we can think Col A & ColB as keys)

For any ColA & ColB combination if C/D/E are exact match then it takes value from any dataframe, however if any value has changed in Col C/D/E , it takes the value from new dataframe and if a new ColA/Col B combination is in DfNew then it takes those values and if the ColA/ColB combination does not exist in dfNew then it takes the value from dfOld:

So my OutPut should be like:

            A   B  C  D   E
            x1  x2  a  b  c
            q1  q2  x  y  z
            t1  t2  h  j  u
            p1  p2  r  s  t
            s1  s2  p  q  r

I was trying :

    mydfL = (df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])
    mydfR = (df1.merge(df,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])


    dfO = pd.concat([mydfL,mydfR])

    dfO.drop("_merge", axis=1, inplace=True)

My output looks like: ( I kept the index for clarity)

            A   B  C  D  E
        0  x1  x2  a  b  c
        2  s1  s2  p  q  r
        3  q1  q2  x  y  z
        0  x1  x2  g  h  r
        2  q1  q2  x  y  s
        3  p1  p2  r  s  t

However, this output does not serve my purpose. First and foremost it does not include the totally identical row (between dfOld & dfnew) which consists of :

          t1  t2  h  j  u

and next it includes all the rows where for the ColA/Col x, y and q1, q2, where I just wanted the updated values in ColC/D/E in the new data frame ( dfNew). It includes data from both.

So can I get some help as to what am I missing and what may be a better and elegant way to do this. Thanks in advance.


Solution

  • You can use combine_first using A/B as temporary index:

    out = (dfNew.set_index(['A', 'B'])
                .combine_first(dfOld.set_index(['A', 'B']))
                .reset_index()
          )