Search code examples
javahashmap

Why is HashMap rehashed?


HashMap in Java uses separate chaining to handle collision and that is why you can do insertion any number of times without any problem.

What I do not understand is the reason why HashMap ( in Java ) needs to be rehashed.

When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets

The HashMap gets expanded so why?

Is that just an optimization of the classical separate chaining to limit the number of keys that each bucket can have? (Because these keys have the same hash.)


Solution

  • Is that just an optimization of the classical Separate chaining to limit the number of keys that each bucket can have ( because these keys have the same hash ) ?

    That is more or less correct1.

    A simplified version2 of how a Java HashMap lookup works is as follows:

    1. The key.hashCode() is called to get a 32 bit hash code.
    2. The 32 bit hash code is reduced to a smaller integer which can be used as an index for the array of hash buckets.
    3. A selected hash bucket is searched for an entry whose key equals the key we that are looking for.
    4. The value stored in the entry is returned. If there was no match, null is returned instead.

    Let us say that we have a total of N entries in the map, and there are currently B buckets.

    If we assume that the map entries for existing keys are equally distributed over the buckets, that means that there are roughly L(N, B) = N / B entries in each bucket.

    Now steps 1, 2 and 4 are constant with respect to N and B. However the cost of searching a hash bucket depends on the number of elements in the bucket.

    In a naive implementation3 the cost of the search will be O(L). Specifically, you will need to check on average L / 2 entries if the key is in the map and L if it isn't.

    But as we stated previously, L is actually a function of the number of buckets B. Indeed, if we increase the number B, we reduce L(N, B).

    In a nutshell, rehashing increases the number of buckets B which in turn reduces the average cost of searching the hash buckets, resulting in faster lookups on average.


    1 - You said: "because these keys have the same hash". That is incorrect. The keys don't necessarily have the same hashCode. The hash collision occurs after the reduction step.
    2 - For example, this ignores the fact that Java will switch to a red-black tree representation for any hash bucket whose chain gets too long. This only happens when the distribution of the keys across the buckets is (significantly) uneven. This is independent of rehashing, and hence we don't need to consider it here.
    3 - That is, ignoring the red-black trees. In fact, these should not change the end result. Check it for yourself if you want to.