Search code examples
hashhash-functionmurmurhash

How to generate hash of arbitrary length with MurmurHash3 32 bit


I am currently trying to hash a set of strings using MurmurHash3, since 32 bit hash seems to be too large for me to handle. I wanted to reduce the number of bit used to generate hashes to around 24 bits. I already found some questions explaining how to reduce it to 16, 8, 4, 2 bit using XOR folding, but these are too few bits for my application.

Can somebody help me?


Solution

  • When you have a 32-bit hash, it's something like (with spaces for readability):

    1101 0101  0101 0010  1010 0101  1110 1000
    

    To get a 24-bit hash, you want to keep the lower order 24 bits. The notation for that will vary by language, but many languages use "x & 0xFFF" for a bit-wise AND operation with 0xFFF hex. That effectively does (with the AND logic applied to each vertical column of numbers, so 1 AND 1 is 1, and 0 and 1 is 0):

    1101 0101  0101 0010  1010 0101  1110 1000 AND  <-- hash value from above
    0000 0000  1111 1111  1111 1111  1111 1111      <-- 0xFFF in binary
    ==========================================
    0000 0000  0101 0010  1010 0101  1110 1000
    

    You do waste a little randomness from your hash value though, which doesn't matter so much with a pretty decent hash like murmur32, but you can expect slightly reduced collisions if you instead further randomise the low-order bits using the high order bits you'd otherwise chop off. To do that, right-shift the high order bits and XOR them with lower-order bits (it doesn't really matter which). Again, a common notation for that is:

     ((x & 0xF000) >> 8) ^ x
    

    ...which can be read as: do a bitwise-AND to retrain only the most significant byte of x, then shift that right by 8 bits, then bitwise excluse-OR that with the original value of X. The result of the above expression then has bit 23 (counting from 0 as the least signficant bit) set if and only if one or other (but not both) of bits 23 and 31 were set in the value of x. Similarly, bit 22 is the XOR of bits 22 and 30. So it goes down to bit 16 which is the XOR of bit 16 and bit 24. Bits 0..15 remain the same as in the original value of x.

    Yet another approach is to pick a prime number ever-so-slightly lower than 2^24-1, and mod (%) your 32-bit murmur hash value by that, which will mix in the high order bits even more effectively than the XOR above, but you'll obviously only get values up to the prime number - 1, and not all the way to 2^24-1.