Search code examples
javahashjava-8hashmaphash-collision

HashMap hash function- Binary operator


I was going through the source code of HashMap, but the binary operators confuses a lot.

I do understand the general purpose of below, fair distribution and bring hashCode within the bucket limit.

Can someone explain the comments here and what is the benefit of doing the way it is done right now?

/**
     * Computes key.hashCode() and spreads (XORs) higher bits of hash
     * to lower.  Because the table uses power-of-two masking, sets of
     * hashes that vary only in bits above the current mask will
     * always collide. (Among known examples are sets of Float keys
     * holding consecutive whole numbers in small tables.)  So we
     * apply a transform that spreads the impact of higher bits
     * downward. There is a tradeoff between speed, utility, and
     * quality of bit-spreading. Because many common sets of hashes
     * are already reasonably distributed (so don't benefit from
     * spreading), and because we use trees to handle large sets of
     * collisions in bins, we just XOR some shifted bits in the
     * cheapest possible way to reduce systematic lossage, as well as
     * to incorporate impact of the highest bits that would otherwise
     * never be used in index calculations because of table bounds.
     */
    static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

It would be a really big help if someone could help me understand it.

This is not a duplicate because the other questions are related to hash implementations before Java 8.

Thanks in advance


Solution

  • hashCode() returns an int, which is 32 bits wide.

    Internally, HashMap keeps the objects in pow(2, n) buckets or bins. The value of n can vary — the details are not important here; what is important is that n is typically much smaller than 32 (the number of bits in the hash).

    Each object is placed into one of the buckets. To achieve good performance, it is desirable to spread the objects evenly across the buckets. This is where object hashes come in: The simplest way to choose a bucket would be by taking the lowest n bits of the object's hash code (using a simple bitwise AND). However, this would only use the lowest n bits and ignore the rest of the hash.

    In the comments, the authors make an argument that this is undesirable. They cite examples of known use cases in which object hashes would systematically differ in bits other than the lowest n. This would lead to systematic collisions and systematic collisions are bad news.

    To partially address this, they've implemented the current heuristic that:

    • keeps the top 16 bits of the hash as they are;
    • replaces the bottom 16 bits with an XOR of the top 16 and the bottom 16 bits.