Search code examples
python-3.xpandasdataframeouter-joindistinct-values

get distinct columns dataframe


Hello how can i do to only the lines where val is different in the 2 dataframes. Notice that i can have id1 or id2 or both as below.

d2 = {'id1': ['X22', 'X13',np.nan,'X02','X14'],'id2': ['Y1','Y2','Y3','Y4',np.nan],'VAL1':[1,0,2,3,0]}
F1 = pd.DataFrame(data=d2)
d2 = {'id1': ['X02', 'X13',np.nan,'X22','X14'],'id2': ['Y4','Y2','Y3','Y1','Y22'],'VAL2':[1,0,4,3,1]}
F2 = pd.DataFrame(data=d2)

Expected Output

d2 = {'id1': ['X02',np.nan,'X22','X14'],'id2': ['Y4','Y3','Y1',np.nan],'VAL1':[3,2,1,0],'VAL2':[1,4,3,1]}

F3 = pd.DataFrame(data=d2)


Solution

  • First merge by all columns with left_on and right_on parameters, then filter out both rows and remove missing values by reshape by stack with unstack:

    df=pd.merge(F1, F2, left_on=['id1','id2','VAL2'], 
                        right_on=['id1','id2','VAL1'], how="outer", indicator=True)
    
    df=(df[df['_merge'] !='both']
            .set_index(['id1','id2'])
            .drop('_merge', 1)
            .stack()
            .unstack()
            .reset_index())
    
    print (df)
       id1 id2 VAL2 VAL1
    0  X02  Y4    3    1
    1  X22  Y1    1    3