Search code examples
pythonhashtablehash-function

How does the hash base and table size affect the time complexity of the hash?


I was learning about hash tables last week but I'm wondering what's the best value to pick for the hash base and also the table size for my hash function in order for it to operate on a good time complexity.

Here's the code for my hash function:

h = 0
for i in range(len(key)):
    h = (h * hashBase + ord(key[i])) % tableCapacity
return h

Why does picking hashBase = 1 increase the time complexity of the hash table's operations? Why is it better to pick a large tableCapacity? Also, why does ie. hashBase = 250726 and table capacity = 250727 cause its operations to slow down?


Solution

  • tableCapacity should generally be kept in a sane ratio to the number of keys that will be hashed into the table. Exactly what ratio depends on how hash collisions will be handled - namely either:

    1. alternative buckets will be found ("open addressing" aka "closed hashing"): with a good hash function 20-50% more buckets than keys is a generally sane range

    2. each bucket contains some chain of elements that hashed there ("separate chaining"): with a good hash function it doesn't matter as much, so you could have half as many buckets as keys, or twice as many, and things will rock along without any great drama

    That said, when the hash function isn't good, and the keys being hashed aren't random enough to help the hash function perform adequately, it helps to have a tableCapacity that reduces collisions: try any prime number around the value derived from the number-of-keys-being-hashed and the ratios listed above. For example, if you have 6 keys and are using separate chaining, a tableCapacity of 5, 7 or 11 would be sane.

    But, your question doesn't say how collisions will be handled, so we'll leave that with you.

    Let's move on to considering the hashing logic itself:

    h = (h * hashBase + ord(key[i])) % tableCapacity
    

    This is like a simplified / compromised form of the "MAD" hash approach described in this question - there's an explanation in my answer which I'll assume hereafter that you've read.

    If we contrast your function with the general MAD form, we see you're using % tableCapacity on every slice (byte?) of the key. The reason that might make sense in python is that python doesn't have fixed-number-of-bit integers that overflow like many lower-level languages (and the CPU itself), so if you don't have some % operation inside the loop the h value may grow to a similar size as the entire key - if you're generating a hash of a video file as a cheap checksum, that'd be very slow and wasteful of memory. So, using % to constrain how large h can get after every iteration is sane, but for the reasons explained in the other answer it's especially important that tableCapacity is prime, and hashBase should be chosen to typically produce values much larger than tableCapacity to minimise the amount by which earlier hash buckets are more heavily utilised than later ones (see the 200/255 example in my other answer linked above).

    Summarily: pick a big pseudo-random hashBase - say a 32 or even 64 bit random number, and a prime tableCapacity in a sane ratio to your number of keys given the open/close-hashing design you've chosen.

    Why does picking hashBase = 1 increase the time complexity of the hash table's operations?

    hashBase shouldn't be small - it means the contribution of key[i] isn't likely to wrap h around the table many times before the % operation is applied again, losing all the benefits from that scattering the mapping around.

    Why is it better to pick a large tableCapacity?

    Well, larger tables mean more buckets - with the same number of keys there will tend to be less collisions, but with decent hashing you don't need to go overboard. More buckets means more memory used, and less cache hits, which slows things down.

    Also, why does ie. hashBase = 250726 and table capacity = 250727 cause its operations to slow down?

    As explained above, you want hashBase to be much larger than table capacity.