I have two csv files with same column names like this: File1: (df1)
column1 column2 column3 column4
ABC 100 020 030
DEF 200 040 050
GHI 300 001 002
File2: (df2)
column1 column2 column3 column4
ABC 100 060 070
DEF 200 040 090
I am writing a comparison script to generate the file like this:
column1 column2 column3 column4
ABC 100 020 | 060 030 | 070
DEF 200 . 050 | 090
GHI 300 001 002
Comparing the combination of colummn1 and column2 to highlight the differences in the other columns and if "Column1" and "Column2" not matched then return the same value of the row (In the case of "GHI").
My code looks like this:
df = pd.concat([df1,df2], sort=False)
df.set_index(['column1', 'column2'], inplace=True)
df = df.replace(np.nan, '', regex=True)
def report_diff(x):
print(x)
return '.' if x[0] == x[1] else '{} | {}'.format(*x)
changes = df.groupby(level=['column1', 'column2']).agg(report_diff)
display(changes)
It gives me an error of "index 1 is out of bounds for axis 0 with size 1". The error is happening for "GHI" as x[1]
is not found in the df2. How this can be resolved?
Looking forward.
Try:
def format_cell(a, b):
if pd.isna(a) and pd.isna(b):
return "."
if pd.isna(a):
return b
if pd.isna(b):
return a
if a == b:
return "."
return f"{a} | {b}"
x = df1.merge(df2, on=["column1", "column2"], how="outer")
for c in df1.columns.difference(["column1", "column2"]):
x[c] = [format_cell(a, b) for a, b in zip(x[f"{c}_x"], x[f"{c}_y"])]
x = x.drop(f"{c}_x", axis=1)
x = x.drop(f"{c}_y", axis=1)
print(x)
Prints:
column1 column2 column3 column4
0 ABC 100 20 | 60.0 30 | 70.0
1 DEF 200 . 50 | 90.0
2 GHI 300 1 2