Say I have two large (text) files which are allegedly identical, but I want to make sure. The entire Harry Potter series of 'adult' and 'child' editions perhaps...
If the full text's string representation is too large to be held in memory at once, is it going to be faster to:
or
In other words, would the convenience of comparing 2 small hashes be offset by the time it took to generate said hashes?
I'm expecting a couple of "it depends" answers, so if you want some assumtions to work with:
Option A is only useful if you reuse the hash (i.e. have other files to compare) so that the cost of calculating the hash isn't a factor...
Otherwise Option B is what i would go for...
To get the maximum speed I would use MemoryMappedFile
instances and XOR the content - the comparison can stop at the first encounter of a difference (i.e. the XOR operation returns something != 0). Regarding memory consumption you can use a "moving window" (i.e. via the call to CreateViewAccessor
) which would allow for literally processing files of TB-size...
It could even be worth to test performance of XOR against some LINQ based comparison methods... and always start by comparing the file sizes, this way you avoid doing unnecessary calculations...