Search code examples
hashstring-hashing

Checking for string matches using hashes, without double-checking the entire string


I'm trying to check if two strings are identical as quickly as possible. Can I protect myself from hash collisions without also comparing the entire string?

I've got a cache of items that are keyed by a string. I store the hash of the string, the length of the string, and the string itself. (I'm currently using djb2 to generate the hash.)

To check if an input string is a match to an item in the cache, I compute the input's hash, and compare it to the stored hash. If that matches, I compare the length of the input (which I got as a side effect of computing the hash) to the stored length. Finally, if that matches, I do a full string comparison of the input and the stored string.

Is it necessary to do that full string comparison? For example, is there a string hashing algorithm that can mathematically guarantee that no two strings of the same length will generate the same hash? If not, can an algorithm guarantee that two different strings of the same length will generate different hash codes if any of the first N characters differ?

Basically, any string comparison scheme that offers O(1) performance when the strings differ but better than O(n) performance when they match would be an improvement over what I'm doing now.


Solution

  • For example, is there a string hashing algorithm that can mathematically guarantee that no two strings of the same length will generate the same hash?

    No, and there can't be. Think about it: The hash has a finite length, but the strings do not. Say for argument's sake that the hash is 32-bits. Can you create more than 2 billion unique strings with the same length? Of course you can - you can create an infinite number of unique strings, so comparing the hashes is not enough to guarantee uniqueness. This argument scales to longer hashes.

    If not, can an algorithm guarantee that two different strings of the same length will generate different hash codes if any of the first N characters differ?

    Well, yes, as long as the number of bits in the hash is as great as the number of bits in the string, but that's probably not the answer you were looking for.

    Some of the algorithms used for cyclic redundancy checks have guarantees like if there's exactly one bit different then the CRC is guaranteed to be different over a certain run length of bits, but that only works for relatively short runs.