I was learning about hash tables last week but I'm wondering what's the best value to pick for the hash base and also the table size for my hash function in order for it to operate on a good time complexity.
Here's the code for my hash function:
h = 0
for i in range(len(key)):
h = (h * hashBase + ord(key[i])) % tableCapacity
return h
Why does picking hashBase = 1 increase the time complexity of the hash table's operations? Why is it better to pick a large tableCapacity? Also, why does ie. hashBase = 250726 and table capacity = 250727 cause its operations to slow down?
tableCapacity
should generally be kept in a sane ratio to the number of keys that will be hashed into the table. Exactly what ratio depends on how hash collisions will be handled - namely either:
alternative buckets will be found ("open addressing" aka "closed hashing"): with a good hash function 20-50% more buckets than keys is a generally sane range
each bucket contains some chain of elements that hashed there ("separate chaining"): with a good hash function it doesn't matter as much, so you could have half as many buckets as keys, or twice as many, and things will rock along without any great drama
That said, when the hash function isn't good, and the keys being hashed aren't random enough to help the hash function perform adequately, it helps to have a tableCapacity
that reduces collisions: try any prime number around the value derived from the number-of-keys-being-hashed and the ratios listed above. For example, if you have 6 keys and are using separate chaining, a tableCapacity
of 5, 7 or 11 would be sane.
But, your question doesn't say how collisions will be handled, so we'll leave that with you.
Let's move on to considering the hashing logic itself:
h = (h * hashBase + ord(key[i])) % tableCapacity
This is like a simplified / compromised form of the "MAD" hash approach described in this question - there's an explanation in my answer which I'll assume hereafter that you've read.
If we contrast your function with the general MAD form, we see you're using % tableCapacity
on every slice (byte?) of the key. The reason that might make sense in python is that python doesn't have fixed-number-of-bit integers that overflow like many lower-level languages (and the CPU itself), so if you don't have some %
operation inside the loop the h
value may grow to a similar size as the entire key - if you're generating a hash of a video file as a cheap checksum, that'd be very slow and wasteful of memory. So, using %
to constrain how large h
can get after every iteration is sane, but for the reasons explained in the other answer it's especially important that tableCapacity
is prime, and hashBase
should be chosen to typically produce values much larger than tableCapacity
to minimise the amount by which earlier hash buckets are more heavily utilised than later ones (see the 200/255 example in my other answer linked above).
Summarily: pick a big pseudo-random hashBase
- say a 32 or even 64 bit random number, and a prime tableCapacity
in a sane ratio to your number of keys given the open/close-hashing design you've chosen.
Why does picking hashBase = 1 increase the time complexity of the hash table's operations?
hashBase
shouldn't be small - it means the contribution of key[i]
isn't likely to wrap h
around the table many times before the %
operation is applied again, losing all the benefits from that scattering the mapping around.
Why is it better to pick a large tableCapacity?
Well, larger tables mean more buckets - with the same number of keys there will tend to be less collisions, but with decent hashing you don't need to go overboard. More buckets means more memory used, and less cache hits, which slows things down.
Also, why does ie. hashBase = 250726 and table capacity = 250727 cause its operations to slow down?
As explained above, you want hashBase to be much larger than table capacity.