My problem is kind of a bit tricky (similar to sql merge/update), and not understanding how to fix: ( I am giving a small sample of the dataframes below)
I have two dataframes :
dfOld
A B C D E
x1 x2 g h r
q1 q2 x y s
t1 t2 h j u
p1 p2 r s t
AND
dfNew
A B C D E
x1 x2 a b c
s1 s2 p q r
t1 t2 h j u
q1 q2 x y z
We want to merge the dataframes with the following rule : ( we can think Col A & ColB as keys)
For any ColA & ColB combination if C/D/E are exact match then it takes value from any dataframe, however if any value has changed in Col C/D/E , it takes the value from new dataframe and if a new ColA/Col B combination is in DfNew then it takes those values and if the ColA/ColB combination does not exist in dfNew then it takes the value from dfOld:
So my OutPut should be like:
A B C D E
x1 x2 a b c
q1 q2 x y z
t1 t2 h j u
p1 p2 r s t
s1 s2 p q r
I was trying :
mydfL = (df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])
mydfR = (df1.merge(df,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])
dfO = pd.concat([mydfL,mydfR])
dfO.drop("_merge", axis=1, inplace=True)
My output looks like: ( I kept the index for clarity)
A B C D E
0 x1 x2 a b c
2 s1 s2 p q r
3 q1 q2 x y z
0 x1 x2 g h r
2 q1 q2 x y s
3 p1 p2 r s t
However, this output does not serve my purpose. First and foremost it does not include the totally identical row (between dfOld & dfnew) which consists of :
t1 t2 h j u
and next it includes all the rows where for the ColA/Col x, y and q1, q2, where I just wanted the updated values in ColC/D/E in the new data frame ( dfNew). It includes data from both.
So can I get some help as to what am I missing and what may be a better and elegant way to do this. Thanks in advance.
You can use combine_first
using A/B as temporary index:
out = (dfNew.set_index(['A', 'B'])
.combine_first(dfOld.set_index(['A', 'B']))
.reset_index()
)