Search code examples
pythoncomparedifflarge-files

How to compare (diff) two large CSV files where order does not matter


I'm struggling with comparing (diff'ing) 2 large CSV files.

  • The order of rows doesn't matter
  • I don't need to print the differences or anything, just a True or False.

For example:

File 1

a,b,c,d
e,f,g,h
i,j,k,l

File 2

a,b,c,d
i,j,k,l
e,f,g,h

The above should pass comparison, even though the rows are not in the same order, the contents are identical.

Comparison should fail if contents differ, column values don't match, or a line exists in one but not another, etc.

The biggest issue I have is that the files are very large, and there is no key column to sort on. Files have 14 to 30 million rows, about 10 to 15 columns. A raw dump of the data, unsorted, comes to around 1gb csv file.

Right now I'm trying to sort the files and "diff" using the code below. The problem is that the "Sort" does not always work. With smaller files and less rows, the sort & diff works, but it doesn't seem to work with very large files.

Also, sorting adds significant operation time; ideally I would like to avoid sorting and just compare ignoring sort order, but I don't know how to do that.

filecmm, difflib and some other functions I tried all require pre-sorted files.

I'm performing a Python Merge Sort right now, but like I said, the sorting doesn't necessarily work on very large number of rows, and I'm hoping for a better way to compare.

This is the python merge sort function:

def batch_sort(self, input, output, key=None, buffer_size=32000, tempdirs=None):
                if isinstance(tempdirs, str):
                        tempdirs = tempdirs.split(",")

                if tempdirs is None:
                        tempdirs = []
                if not tempdirs:
                        tempdirs.append(gettempdir())

                chunks = []
                try:
                        with open(input,'rb',64*1024) as input_file:
                                input_iterator = iter(input_file)
                                for tempdir in cycle(tempdirs):
                                        current_chunk = list(islice(input_iterator,buffer_size))
                                        if not current_chunk:
                                                break
                                        current_chunk.sort(key=key)
                                        output_chunk = open(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024)
                                        chunks.append(output_chunk)
                                        output_chunk.writelines(current_chunk)
                                        output_chunk.flush()
                                        output_chunk.seek(0)
                        with open(output,'wb',64*1024) as output_file:
                                output_file.writelines(self.merge(key, *chunks))
                finally:
                        for chunk in chunks:
                                try:
                                        chunk.close()
                                        os.remove(chunk.name)
                                except Exception:
                                        pass

I can call batch_sort(), give it an input file and output file, size of chunks, and the temporary directory to use.

Once I perform batch_sort() on both files, I can just "diff file1 file2".

The above works with 25,000 to 75,000 rows, but not when we're talking in excess of 14 million rows.


Solution

  • Just use a set and add each row. Compare the sets at the end:

    def compare(file1, file2):
        with open(file1) as fh1, open(file2) as fh2:
            left = {line for line in fh1}
            right = {line for line in fh2}
    
        return left == right
    

    If you're really concerned about size, you can consume one file and the second a line isn't found in the second file, you can short-circuit it:

    def compare(file1, file2):
        with open(file1) as fh:
            left = {line for line in fh}
        
        right = set()
    
        with open(file2) as fh:
            for line in fh:
                if line not in left:
                    return False
            
                right.add(line)
    
        return left == right
    

    Edit

    Since you don't care to show the differences, just check if they are the same, you can use a numeric hash on each row and compare the sums for each file:

    def are_equivalent(a, b):
        with open(a) as fh, open(b) as gh:
            x = sum(hash(line) for line in fh)
            y = sum(hash(line) for line in gh)
    
        return x == y
    

    This way you aren't paying for the storage of lines in any data structure