Search code examples
pythonpandascomparereconciliation

Find mismatch in all columns between two rows linked by another column in dataframe


I have a large df with many columns and rows, with usually two rows per certain identifier as df is used for reconciliation. Is there any way to streamline identification of non-identifier columns which cause mismatch?

import pandas as pd

df = pd.DataFrame({'col_1':       ['A', 'B', 'C', 'B', 'C', 'D', 'E'],
                    'identifier': [  1,   2,   3,   2,   3,   4,   4],
                    'col_3':      [ 10,  20,   30,  21, 31,  40,  41],
                    'col_4':      [  1,   1,    1,   1,  1,   1,   1]
                    })

In above df, it would be

  • col_1 for identifier 4 (D vs. E)
  • col_3 for identifier 2/3/4 (20 vs. 21, 30 vs. 31, 40 vs. 41)

Open to any representation that makes it easy to isolate the columns causing mismatch, their values and identifiers.


Solution

  • IIUC, you can agregate the columns as sets and keep those with more than one element:

    s = df.groupby('identifier').agg(set).stack()
    out = s[s.str.len().gt(1)]
    

    output:

    identifier       
    2           col_3    {20, 21}
    3           col_3    {30, 31}
    4           col_1      {D, E}
                col_3    {40, 41}
    dtype: object
    

    further aggregation:

    out.reset_index(level=1)['level_1'].groupby(level=0).agg(list)
    

    output:

    identifier
    2           [col_3]
    3           [col_3]
    4    [col_1, col_3]
    Name: level_1, dtype: object