Search code examples
pythoncsvdifflib

Compare 2 large CSVs using python - output the differences


I am writing a program to compare all files and directories between two filepaths (basically the files metadata, content, and internal directories should match)

File content comparison is done row by row. Dimensions of the csv may or may not be the same, but below approaches generally manages scenerios whereby dimensions are not the same.

The problem is that processing time is too slow.

Some context:

  1. The two files are identified to be different using filecmp
  2. This particular problematic csv is ~11k columns and 800 rows.
  3. My program will not know what is the data type within the csv beforehand, so defining the dtype for pandas is not an option
  4. Difflib does an excellent job if the csv file is small, but not for this particular usecase

I've looked at all the related questions on SO, and tried these approaches, but the processing time was terrible. Approach 3 gives weird results

Approach 1 (Pandas) - Terrible wait and I keep getting this error

UserWarning: You are merging on int and float columns where the float values are not equal to their int representation.

import pandas as pd
import numpy as np

df1 = pd.read_csv(f1)
df2 = pd.read_csv(f2)
diff = df1.merge(df2, how='outer', indicator='exists').query("exists!='both'")
print(diff)

Approach 2 (Difflib) - Terrible wait for this huge csv

import difflib 

def CompareUsingDiffLib(f1, f2 ):
    html = h.make_file(file1_lines, file2_lines, context=True,numlines=0)
    htmlfilepath = filePath + "\\htmlFiles"
    with open(htmlfilepath, 'w') as fh: 
        fh.write(html)

with open (file1) as f, open(file2) as z:
    f1 = f.readlines()
    f2 = z.readlines()
    CompareUsingDiffLib(f1, f2 )  

Approach 3 (Pure python) - Incorrect results

with open (f1) as f, open(f2) as z:
    file1 = f.readlines()
    file2 = z.readlines()
# check row number of diff in file 1
    for line in file1:
        if line not in file2:
            print(file1.index(line))

# it shows from all the row from row number 278 to last row 
# is not in file 2, which is incorrect 
# I checked using difflib, and using excel as well
# no idea why the results are like that

# running below code shows the same result as the first block of code
    for line in file2:
        if line not in file1:
            print(file2.index(line))

Approach 4 (csv-diff) - Terrible wait

from csv_diff import load_csv, compare

diff = compare(
    load_csv(open("one.csv")),
    load_csv(open("two.csv"))
)

Can anybody please help on either:

  1. An approach with less processing time
  2. Debugging Approach 3

Solution

  • Comparing the files with readlines() and just testing for membership ("this in that?") does not equal diff'ing the lines.

    with open (f1) as f, open(f2) as z:
        file1 = f.readlines()
        file2 = z.readlines()
    
        for line in file1:
            if line not in file2:
                print(file1.index(line))
    

    Consider these two CSVs:

    file1.csv     file2.csv
    -----------   -----------
    a,b,c,d       a,b,c,d
    1,2,3,4       1,2,3,4
    A,B,C,D       i,ii,iii,iv 
    i,ii,iii,iv   A,B,C,D
    

    That script will produce nothing (and give the false impression there's no diff) because every line in file 1 is in file 2, even though the files differ line-for-line. (I cannot say why you think you were getting false positives, though, without seeing the files.)

    I recommend using the CSV module and iterating the files row by row, and then even column by column:

    import csv
    
    path1 = "file1.csv"
    path2 = "file2.csv"
    
    with open(path1) as f1, open(path2) as f2:
        reader1 = csv.reader(f1)
        reader2 = csv.reader(f2)
    
        for i, row1 in enumerate(reader1):
            try:
                row2 = next(reader2)
            except StopIteration:
                print(f"Row {i+1}, f1 has this extra row compared to f2")
                continue
    
            if row1 == row2:
                continue
    
            if len(row1) != len(row2):
                print(f"Row {i+1} of f1 has {len(row1)} cols, f2 has {len(row2)} cols")
                continue
    
            for j, cell1 in enumerate(row1):
                cell2 = row2[j]
                if cell1 != cell2:
                    print(f'Row {i+1}, Col {j+1} of f1 is "{cell1}", f2 is "{cell2}"')
    
        for row2 in reader2:
            i += 1
            print(f"Row {i+1}, f2 has this extra row compared to f1")
    

    This uses an iterator of file1 to drive an iterator for file2, accounts for any difference in row counts between the two files by just noting a StopIteration exception if file1 has more rows than file2, and printing a difference if there are any rows left to read in file2 (reader2) at the very bottom.

    When I run that against these files:

    file1         file2
    -----------   ----------
    a,b,c,d       a,b,c      
    1,2,3,4       1,2,3,4    
    A,B,C,D       A,B,C,Z    
    i,ii,iii,iv   i,ii,iii,iv
                  x,xo,xox,xoxo
    

    I get:

    Row 1 of f1 has 4 cols, f2 has 3 cols
    Row 3, Col 4 of f1 is "D", f2 is "Z"
    Row 5, f2 has an extra row compared to f1
    

    If I swap path1 and path2, I get this:

    Row 1 of f1 has 3 cols, f2 has 4 cols
    Row 3, Col 4 of f1 is "Z", f2 is "D"
    Row 5, f1 has this extra row compared to f2
    

    And it does this fast. I mocked up two 800 x 11_000 CSVs with very, very small differences between rows (if any) and it processed all diffs in under a second of user time (not counting printing).