I have been assigned as a university project the task to create data structures (such as minheap, hashtable etc.) from scratch. However the Hashtable or more specifically the Hash maps - functions have given me quite some trouble. I have come across the MAD (Multiply, Add, Divide) function which basically is: h(x) = [(a*x + b) % p] % N, where a,b : random integers, p : large prime number and N : number of elements in hashtable.
My question is how (and why) exactly does this function distribute evenly the values in the hashtable.
h(x) = [(a*x + b) % p] % N
Let's look at a*x + b
in isolation first. If you imagine a
broken down into a sum of powers of two, a*x
is then the sum of x
bit shifted left by a smattering of powers of two, such that each bit in x
impacts other bit positions that are set in a
, and some further bits when the summation produces carries at particular bits. Adding b
mixes in another set of random bits: much like XORing would, but with some extra complexity from the carries. If say x
has is a value between 0 and 255, with bits abcdefgh
(each being 0 or 1), then so far we've got:
(a&1 ? abcdefgh : 0) +
(a&2 ? abcdefgh0 : 0) +
(a&4 ? abcdefgh00 : 0) +
(a&8 ? abcdefgh000 : 0) +
... + // continues for a&16, a&32 etc.
ABCDEFGHIJKLMNOP // however many random bits in "b"
So, in the "1s" column we're summing h
and P
, which might carry into the "2s" column with g
, h
and O
, and on it goes.
If a
is say 37, which is 32+4+1, then we're adding x
itself, x << 2
, and x << 5
: each bit in x
thereby impacts more bits in the hash value (this is good, indeed with a cryptographic-strength hash function, changing any bits in the key - whether a single bit, half or all of them - should pretty much randomly flip about half the bits in the hash value).
Returning to the full formula, let's imagine we skipped the % p
and had just % N
, but the current table size is a power of two: % N
is then equivalent to a bitwise-AND operation for some number of less-significant bits. Put another way, it's throwing away a lot of the randomness we've built up in the more significant bits of our a * x + b
calculation. So, to make the hash function safe to use with any number of buckets, we can introduce % p
first, which means if there are patterns in the hash value related to power-of-two positions from the summation step, they're effectively scattered across random positions in the 0..p range.
Consider say a hash between 0 and 255 - if N
was 200, we'd be twice as likely to hash to a bucket in the 0..55 range. To make this effect less significant, we want the hash value to have many more bits than the MOD value, and this principle applies in a layered way to the values we should choose for p
and N
:
a * x + b
values should tend to be significantly larger than p
, and be spread across a range much larger than p
, so % p
separates them more across the buckets, but
p
should be much larger than N
, so we don't have low-indexed buckets with significantly higher collision probabilities (which is especially bad if you're using linear probing to resolve collisions).
For example, if we wanted to support values of N
up to 224, and we're doing these calculations with 32 bit unsigned integers so a
and b
have random values in that range, we might split the difference pick a prime around about 228.