Search code examples
c++algorithmhashxorbit-shift

Hash algorithm for string of characters using XOR and bit shift


I was given this algorithm to write a hash function:

BEGIN Hash (string)
UNSIGNED INTEGER key = 0;
FOR_EACH character IN string
key = ((key << 5) + key) ^ character;
END FOR_EACH
RETURN key;
END Hash

The <<operator refers to shift bits to the left. The ^ refers to the XOR operation and the character refers to the ASCII value of the character. Seems pretty straightforward.

Below is my code

unsigned int key = 0;
for (int i = 0; i < data.length(); i++) {
    key = ((key<<5) + key) ^ (int)data[i];
}
return key;

However, I keep getting ridiculous positive and negative huge numbers when i should actually get a hash value from 0 - n. n is a value set by the user beforehand. I'm not sure where things went wrong but I'm thinking it could be the XOR operation.

Any suggestions or opinions will be greatly appreciated. Thanks!


Solution

  • The output of this code is a 32-bit (or 64-bit or however wide your unsigned int is) unsigned integer. To restrict it to the range from 0 to n−1, simply reduce it modulo n, using the % operator:

    unsigned int hash = key % n;
    

    (It should be obvious that your code, as written, cannot return "a hash value from 0 - n", since n does not appear anywhere in your code.)

    In fact, there's a good reason not to reduce the hash value modulo n too soon: if you ever need to grow your hash, storing the unreduced hash codes of your strings saves you the effort of recalculating them whenever n changes.

    Finally, a few general notes on your hash function:

    • As Joachim Pileborg comments above, the explicit (int) cast is unnecessary. If you want to keep it for clarity, it really should say (unsigned int) to match the type of key, since that's what the value actually gets converted into.

    • For unsigned integer types, ((key<<5) + key) is equal to 33 * key (since shifting left by 5 bits is the same as multiplying by 25 = 32). On modern CPUs, using multiplication is almost certainly faster; on old or very low-end processors with slow multiplication, it's likely that any decent compiler will optimize multiplication by a constant into a combination of shifts and adds anyway. Thus, either way, expressing the operation as a multiplication is IMO preferable.

    • You don't want to call data.length() on every iteration of the loop. Call it once before the loop and store the result in a variable.

    • Initializing key to zero means that your hash value is not affected by any leading zero bytes in the string. The original version of your hash function, due to Dan Bernstein, uses a (more or less random) initial value of 5381 instead.