HashMap in Java uses separate chaining to handle collision and that is why you can do insertion any number of times without any problem.
What I do not understand is the reason why HashMap ( in Java ) needs to be rehashed.
When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets
The HashMap gets expanded so why?
Is that just an optimization of the classical separate chaining to limit the number of keys that each bucket can have? (Because these keys have the same hash.)
Is that just an optimization of the classical Separate chaining to limit the number of keys that each bucket can have ( because these keys have the same hash ) ?
That is more or less correct1.
A simplified version2 of how a Java HashMap
lookup works is as follows:
key.hashCode()
is called to get a 32 bit hash code.null
is returned instead.Let us say that we have a total of N
entries in the map, and there are currently B
buckets.
If we assume that the map entries for existing keys are equally distributed over the buckets, that means that there are roughly L(N, B) = N / B
entries in each bucket.
Now steps 1, 2 and 4 are constant with respect to N
and B
. However the cost of searching a hash bucket depends on the number of elements in the bucket.
In a naive implementation3 the cost of the search will be O(L)
. Specifically, you will need to check on average L / 2
entries if the key is in the map and L
if it isn't.
But as we stated previously, L
is actually a function of the number of buckets B
. Indeed, if we increase the number B
, we reduce L(N, B)
.
In a nutshell, rehashing increases the number of buckets B
which in turn reduces the average cost of searching the hash buckets, resulting in faster lookups on average.
1 - You said: "because these keys have the same hash". That is incorrect. The keys don't necessarily have the same hashCode
. The hash collision occurs after the reduction step.
2 - For example, this ignores the fact that Java will switch to a red-black tree representation for any hash bucket whose chain gets too long. This only happens when the distribution of the keys across the buckets is (significantly) uneven. This is independent of rehashing, and hence we don't need to consider it here.
3 - That is, ignoring the red-black trees. In fact, these should not change the end result. Check it for yourself if you want to.