I have a large df with many columns and rows, with usually two rows per certain identifier as df is used for reconciliation. Is there any way to streamline identification of non-identifier columns which cause mismatch?
import pandas as pd
df = pd.DataFrame({'col_1': ['A', 'B', 'C', 'B', 'C', 'D', 'E'],
'identifier': [ 1, 2, 3, 2, 3, 4, 4],
'col_3': [ 10, 20, 30, 21, 31, 40, 41],
'col_4': [ 1, 1, 1, 1, 1, 1, 1]
})
In above df, it would be
Open to any representation that makes it easy to isolate the columns causing mismatch, their values and identifiers.
IIUC, you can agregate the columns as sets and keep those with more than one element:
s = df.groupby('identifier').agg(set).stack()
out = s[s.str.len().gt(1)]
output:
identifier
2 col_3 {20, 21}
3 col_3 {30, 31}
4 col_1 {D, E}
col_3 {40, 41}
dtype: object
further aggregation:
out.reset_index(level=1)['level_1'].groupby(level=0).agg(list)
output:
identifier
2 [col_3]
3 [col_3]
4 [col_1, col_3]
Name: level_1, dtype: object