Search code examples
pythoncsv

Faster way to compute difference between two csv files


I'm trying to compute difference between two large csv files (~ 4GB) to obtain newly added rows and writing these into an output csv file. I'm able to obtain this functionality for relatively small files (~50MB) by using the following code.

input_file1 = "data.csv"
input_file2 = "data_1.csv"
output_path = "out.csv"

with open(input_file1, 'r') as t1, open(input_file2, 'r') as t2:
    fileone = t1.readlines()
    filetwo = t2.readlines()

with open(output_path, 'w') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

However, for larger files, the above code is either too slow (runs for about half an hour) or crashes with lack of memory space.

Is there a faster way to obtain the difference for large csv files?


Solution

  • You don't have to read the second file fully, just read line by line.

    And for the speed, just make a set out of the first file (fast lookup, and saves memory if there are duplicate lines). For this you have to keep the second file open when you're writing the result:

    input_file1 = "data.csv"
    input_file2 = "data_1.csv"
    output_path = "out.csv"
    
    with open(input_file1, 'r') as t1:
        fileone = set(t1)
    
    with open(input_file2, 'r') as t2, open(output_path, 'w') as outFile:
        for line in t2:
            if line not in fileone:
                outFile.write(line)
    
    • for line in t2 reads the file line by line (always avoid readlines() if you can) so even if the file is big, the memory footprint is small.
    • fileone takes some memory, true, but hopefully if it's smaller and/or has duplicate lines, not so much, and of course less than with readlines()
    • if line not in fileone may look the same as before, but it has an average O(1) complexity, which makes the program much faster

    If one of the files doesn't end with a newline, the unicity could kind of fail. This case should be handled with a slight performance overhead:

    • by using rstrip() on each line before inserting in set or testing (but it also removes trailing spaces)
    • by making a special case out of the last line.
    • by seeking at end of each file and check last char, abort if not a linefeed.