Search code examples
c#.nethashstring-comparison

Comparing large text files - Is comparing hashes faster than using subsets of the file?


Say I have two large (text) files which are allegedly identical, but I want to make sure. The entire Harry Potter series of 'adult' and 'child' editions perhaps...

If the full text's string representation is too large to be held in memory at once, is it going to be faster to:

  • a) Hash both files in their entirety and then test to see if the hashes are identical

or

  • b) Read in manageable chunks of each file and compare them until you either reach EOF or find a mismatch

In other words, would the convenience of comparing 2 small hashes be offset by the time it took to generate said hashes?

I'm expecting a couple of "it depends" answers, so if you want some assumtions to work with:

  • Language is C# in .NET
  • Text files are 3GB each
  • Hash function is MD5
  • Maximum 'spare' RAM is 1GB

Solution

  • Option A is only useful if you reuse the hash (i.e. have other files to compare) so that the cost of calculating the hash isn't a factor...

    Otherwise Option B is what i would go for...

    To get the maximum speed I would use MemoryMappedFile instances and XOR the content - the comparison can stop at the first encounter of a difference (i.e. the XOR operation returns something != 0). Regarding memory consumption you can use a "moving window" (i.e. via the call to CreateViewAccessor) which would allow for literally processing files of TB-size...

    It could even be worth to test performance of XOR against some LINQ based comparison methods... and always start by comparing the file sizes, this way you avoid doing unnecessary calculations...