My goal is to efficiently perform an exhaustive diff of a directory tree. Given a set of files F
, I must compute the equivalence classes EQV = [F₁, F₂, F₃, ... Fₙ]
such that fⱼ, fₖ ∈ EQV[i] iff (fⱼ is identical to fₖ) for all i, j, k
.
My general approach is to start with just one big class containing all initial files, EQV₀ = [[f₁, f₂, ..., fₙ]]
, and repeatedly split it into more refined classes EQV₁
, EQV₂
... EQVₘ₋₁
, based on some m
heuristics, for example, file size, checksum function 1, checksum 2. After all m
heuristics have been applied (EQVₘ₋₁
), a pairwise diff of all the files within each class in EQVₘ₋₁
must be made. Because this last step is quadratic for each of the classes in EQVₘ₋₁
, ie
O(sum(n² for n in map(len, EQVₘ₋₁)) )
and will probably be the bottleneck of my algorithm if each the m splits are done in linear time, my goal is to make EQVₘ₋₁
as flat as possible.
I would like to have access to a variety of good hash functions that I can apply to minimize collisions on EQVₘ₋₁
. My current idea is to use some library provided checksum function, such as adler, and to generate variations of it by simply applying it to different starting bytes within the file. Another one is to first apply fast hash functions, such as adler, and then more expensive ones such as md5 on only the classes which are still too large.
Considering that I can compute all the hashes for a given file in just one read of that file, how could I compute a variety of hashes that will help me discriminate among non-identical files?
Alternatively, what is a good list of hash functions available in python that aren't cryptographically secure?
Edit: Another idea seems to use "rabin fingerprint" based on a fixed set of randomly generated inputs. Would this make sense for this purpose? http://en.wikipedia.org/wiki/Rabin_fingerprint
I would recommend first using adler32
, then crc32
. There may be many very short files that have the same adler32
's, but different crc32
's. In fact, you could consider a single step using crc32
on files below a certain size on the first pass. That size could be about 1K. For longer files, adler32
and crc32
will have close to the same collision probability.
Depending on how many files you have, you could consider a subsequent step with larger hashes, such as md5
or sha1
. See this answer for the probability and count of expected collisions for a 32-bit check value. Roughly, if you have millions of files, that step may be worth doing.
You will get no benefit by going to even longer hash values. The 128 bits from md5 is plenty to distinguish all of the files on every computer in the world.