Search code examples
pythonpandascsvcomparison

Finding differences and concatenating the differences in pandas


I have two csv files with same column names like this: File1: (df1)

column1    column2    column3    column4
ABC        100        020        030
DEF        200        040        050
GHI        300        001        002

File2: (df2)

column1    column2    column3    column4
ABC        100        060        070
DEF        200        040        090

I am writing a comparison script to generate the file like this:

column1    column2    column3         column4
ABC        100        020 | 060       030 | 070
DEF        200        .               050 | 090
GHI        300        001             002

Comparing the combination of colummn1 and column2 to highlight the differences in the other columns and if "Column1" and "Column2" not matched then return the same value of the row (In the case of "GHI").

My code looks like this:

df = pd.concat([df1,df2], sort=False)
df.set_index(['column1', 'column2'], inplace=True)
df = df.replace(np.nan, '', regex=True)

def report_diff(x):
    print(x)
    return '.' if x[0] == x[1] else '{} | {}'.format(*x)


changes = df.groupby(level=['column1', 'column2']).agg(report_diff)
display(changes)

It gives me an error of "index 1 is out of bounds for axis 0 with size 1". The error is happening for "GHI" as x[1] is not found in the df2. How this can be resolved?

Looking forward.


Solution

  • Try:

    def format_cell(a, b):
        if pd.isna(a) and pd.isna(b):
            return "."
    
        if pd.isna(a):
            return b
    
        if pd.isna(b):
            return a
    
        if a == b:
            return "."
    
        return f"{a} | {b}"
    
    
    x = df1.merge(df2, on=["column1", "column2"], how="outer")
    
    for c in df1.columns.difference(["column1", "column2"]):
        x[c] = [format_cell(a, b) for a, b in zip(x[f"{c}_x"], x[f"{c}_y"])]
        x = x.drop(f"{c}_x", axis=1)
        x = x.drop(f"{c}_y", axis=1)
    
    print(x)
    

    Prints:

      column1  column2    column3    column4
    0     ABC      100  20 | 60.0  30 | 70.0
    1     DEF      200          .  50 | 90.0
    2     GHI      300          1          2