Search code examples
pythonpandasdataframeconcatenation

pandas concat compare 2 dataframes with same column names how to build a differ column?


I'm using pandas concat to compare 2 dataframes have same columns and rows:

import pandas as pd

df=pd.read_csv(r'C:\Users\compare\T1.csv')
df2=pd.read_csv(r'C:\Users\compare\T2.csv')

index=['contract','RB','payee','fund']

df_all = pd.concat([df.set_index(index), df2.set_index(index)], 
                   axis='columns', keys=['First', 'Second'])

df_final = df_all.swaplevel(axis='columns')[df.columns[54:56]]

df_final

The output is:

                              SD1                SD2
                             First     Second    First   Second
contract    RB  payee   fund                
AG72916Z    2   1   W42      15622.9    15622.9  15622.9    15489.2
            4   1   W44      14697.8    14697.8  14697.8    14572.1
            8   1   W48      7388.56    7388.56  7388.56    7325.37
AL0024AZ    C3  1   202      226.585    226.59   220.366    220.37
            S3  1   204      804.059    804.06   781.99     781.99

My question is how can I add a differ column after each Second ,so that I can easily tell the comparison result,the output should looks like this:

                              SD1                         SD2
                             First     Second   differ    First     Second   differ
contract    RB  payee   fund                
AG72916Z    2   1   W42      15622.9    15622.9    0       15622.9  15489.2  133.7
            4   1   W44      14697.8    14697.8    0       14697.8  14572.1  125.7
            8   1   W48      7388.56    7388.56    0       7388.56  7325.37  63.19
AL0024AZ    C3  1   202      226.585    226.59     0.05    220.366  220.37  -0.004
            S3  1   204      804.059    804.06     0.01    781.99   781.99   0

Solution

  • A bit tricky but necessary to keep ordering:

    out = df_final.stack(level=0).assign(Diff=lambda x: x['First'] - x['Second']) \
                  .stack().unstack(level=[-2, -1])
    print(out)
    
    # Output
                                  SD1                         SD2                   
                                First    Second   Diff      First    Second     Diff
    contract RB payee fund                                                          
    AG72916Z 2  1     W42   15622.900  15622.90  0.000  15622.900  15489.20  133.700
             4  1     W44   14697.800  14697.80  0.000  14697.800  14572.10  125.700
             8  1     W48    7388.560   7388.56  0.000   7388.560   7325.37   63.190
    AL0024AZ C3 1     202     226.585    226.59 -0.005    220.366    220.37   -0.004
             S3 1     204     804.059    804.06 -0.001    781.990    781.99    0.000
    

    Update

    What if I want to only select the row that Diff large than 100

    Use:

    >>> out[out.loc[:, (slice(None), 'Diff')].gt(100).any(1)]
    
                                SD1                    SD2                
                              First   Second Diff    First   Second   Diff
    contract RB payee fund                                                
    AG72916Z 2  1     W42   15622.9  15622.9  0.0  15622.9  15489.2  133.7
             4  1     W44   14697.8  14697.8  0.0  14697.8  14572.1  125.7
    
    # Same result with
    # idx = pd.IndexSlice
    # out[out.loc[:, idx[:, 'Diff']].gt(100).any(1)]