What is the advantage of applying a complex hash function and then taking mod n instead of simply doing mod n for the input?

In hashing, we take the input and apply some complex hashing algorithm. Then, we do mod n to find the bucket or server into which this input needs to be sent. Hash input x -> Hash(x) -> Divide by n - >Hash(x) mod n gives location of the bucket.

If we take the input directly without hashing, it is equivalent to having an identity hash function. Hash(x) =x .. mod n..Wikipedia calls this function a 'trivial' hash function.

Generally,hash(x) is a complex hashing algorithm such as MD5, SHA etc... Q1) Regardless of how we hash it, it just boils down to a value between 0 and n-1(reminder when divided by n). So, how does the choice of hashing function matter? Q2) I know that an ideal hash function distributes the input values uniformly across the buckets. In this aspect, are those complex hashing functions superior to the hash identity function?

Assume that the input is always an integer.

Solution

What is the advantage of applying a complex hash function and then taking mod n instead of simply doing mod n for the input?

Let's look at a simple example. Say our keys are 100 pointers to some objects in memory that are 8-byte aligned: that means the 3 least-significant bits are always 0s. Our table size is currently 128 buckets. We mod the pointer values by 128 before hashing, we're effectively taking:

 32-bit pointer bits   xxxxxxxx xxxxxxxx xxxxxxxx xxxxx000
             mod 128   00000000 00000000 00000000 0xxxx000

Notice that only 4 potentially meaningful bits from the pointer make it through to our hash function, which means at most 16 distinct values reach the hash function: our 100 pointers will collide into only 16 buckets, which means collision chains will typically be 7 or 8 deep even for the strongest hash function. That's woeful given we had 128 buckets for 100 keys: we should have had mostly 0, 1 or 2 keys mapped to any given bucket.

Now, what would have happened if we'd had 100 pointers to memory mapped areas, each 4096-byte page aligned? They all would have mapped to the same bucket.

Not doing the mod operation until the end ensures higher order bits in the keys can help randomise the lower order bit positions in the hash value, and those lower-significance bits can affect the bucket the key maps to. (Another thing that can help a bit is ensuring the table size is a prime number, but that's best used in combination with doing the mod after hashing. As a random sampling, GNU's C++ compiler uses prime bucket counts for Standard Library hash tables, while Visual C++ uses powers of two (and for long strings faster but weaker hash functions))

Q1) Regardless of how we hash it, it just boils down to a value between 0 and n-1(reminder when divided by n). So, how does the choice of hashing function matter?

Obviously if our hash function was h(key) { return 0 } every key would collide at bucket 0. At the other extreme, a crytographic hash function should effectively randomly-but-repeatedly map any given key to a given bucket, such that any bit changing anywhere in the key creates a completely uncorrelated mapping. That helps protect you from excessive collisions with keys that don't vary at many bit positions. But, strong hash functions tend to take longer to calculate, and the reduction in collisions may or may not result in a net performance win. It's sometimes worth choosing the strength of the hash function based on knowledge of how much the keys are likely to differ from each other.

Q2) I know that an ideal hash function distributes the input values uniformly across the buckets. In this aspect, are those complex hashing functions superior to the hash identity function?

At the extreme, identity hash functions hope that the input numbers will map onto distinct buckets with more probability than a crytographic strength hash function would: for example, if we hash 5, 6, 7, 8, 10 into a table using an identity function, they're dense (close to each other) and span just 6 values (5 through 10), so as long as the table size is >= 6 (e.g. prime value 7) they're guaranteed not to collide. But, identity hash functions given collision prone inputs (e.g. pointers cast to numbers) are a disaster as they've done nothing to mix in more-significant bits with less-significant bits before the mod kicks in - same problem explained for pointers above.

Summarily, identity hash functions can have better average-case performance for common integer keys, but have far worse worst-case performance for non-dense, non-random / collision-prone keys.