Adler32 and CRC have the property that f(a || b)
can be computed inexpensively from f(a)
, f(b)
, and len(b)
. Are there any other common non-cryptographic hash functions with this property?
Context (to avoid XY problem) is that I am deduplicating strings by splitting them into chunks, which are indexed by their hash. An input string can then be represented as a sequence of chunks, concatenated. I'd like to use a hash function such that all representations of a string have the same hash, which can be computed directly from the chunk hashes without needing the underlying data, as it is being streamed in unspecified order and thus may not be available in the same place at any one time.
My design calls for roughly 2^32 chunks. Collisions are very expensive, but would not harm correctness. Based on that, I think that CRC64 would work, but I'm curious what my alternatives are. I wouldn't mind a 128 bit hash for future proofing (as in: dataset size may grow).
The probability of one collision among all pairs of your 232 64-bit CRCs is about 1/2. If that's too high for you, you can use a 128-bit CRC. That drops the probability of one collision to 3x10-20.