How do we get the probability of bucket being empty or not is 0.5 in hash map

Java official document says "Ideally, under random hashCodes, the frequency of nodes in bins follows a Poisson distribution (http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5 on average for the default resizing threshold of 0.75". I want to know How do we get bucket being empty or not is 0.5. How to prove it mathematically.

Solution

Here's a discussion of where the 0.5 probably came from.

The context is in the Java HashMap implementation, regarding the point at which a bucket should be converted ("treeified") from a linked list to a red-black tree. This should occur extremely rarely under typical operation. What's a good value to choose that gives "extremely rarely"? An implementation comment from the HashMap implementation bases its justification on a Poisson distribution:

 Ideally, under random hashCodes, the frequency of
 nodes in bins follows a Poisson distribution
 (http://en.wikipedia.org/wiki/Poisson_distribution) with a
 parameter of about 0.5 on average for the default resizing
 threshold of 0.75, although with a large variance because of
 resizing granularity. Ignoring variance, the expected
 occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
 factorial(k)).

https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/HashMap.java#L181

It concludes that a bucket with 8 or more collisions occurs with a probability of less than one in ten million, so 8 was chosen as the "treeify threshold".

Where did the parameter of 0.5 come from? In this case, the parameter means the average load (or fullness) of a HashMap, that is, the number of mappings divided by the "capacity" (table length, or number of buckets). The question can thus be restated as: What is the average load of a HashMap? This can't be derived mathematically, because it depends on the way a program uses a HashMap, the program's input, and so forth. But we can make some reasonable assumptions and arrive at some conclusions which probably apply to typical cases.

First, assume most HashMaps use a load factor of 0.75. Anecdotally I believe this to be true, as I've rarely seen cases where code uses an explicit load factor.
Second, assume that the number of mappings in a HashMap is uniformly distributed. I'm almost positive this is not true, but let's start with this as a working assumption and revisit it later.
Third, let's set aside cases where a HashMap contains over a billion or so elements, where things like Java's array size limitations come into play.
Fourth, for simplicity, let's consider only HashMaps that are created without a preallocated capacity and are populated with some unknown number of mappings.

Recall the definition of load factor: if a HashMap's load exceeds the load factor, the HashMap will be resized larger to keep the load below the load factor.

Clearly, a HashMap in normal operation can't have a load of more than 0.75. A HashMap also won't have a load of less than 0.375, because it won't be resized until its load reaches 0.75, and resizing halves the load. So maybe the average load is midway between 0.375 and 0.75, which is 0.5625.

One could do some math to figure this out, but it was easier for me to write a program to populate HashMaps of sizes between 1 and N and compute the average load over all those cases. If N is about 0.75 of a power of two (say, 768) then the average load is indeed quite close to 0.5625. However, this varies a lot depending on one's choice of N. If a chosen N is midway between 0.75 of successive powers of two (say, 576) then the average load is only 0.507. So the average load of HashMaps with up to N mappings varies between about 0.507 and 0.5625, depending on one's choice of N. (This is the "resizing granularity" variance mentioned in the comment.)

Recall that we assumed that HashMap sizes are uniformly distributed. It's probably not true, but what is the actual distribution? I don't know, but I'd guess that there is exponential falloff as the sizes get larger. If so, this would skew the distribution of sizes towards smaller numbers, reducing the average load to be closer to 0.507 than 0.5625. I'd also guess that a lot of people create HashMaps with the default capacity (16) and populate them with six or fewer mappings. That would pull the average down farther.

In summary, our model is telling us that the average load is somewhere between 0.507 and 0.5625. Some educated guesses tell us that sizes are skewed smaller, so the average load is also smaller. The choice of 0.5 therefore seems reasonable.

But we really don't need a precise answer for the analysis of when treeify HashMap buckets. The point of the analysis is to find a threshold such that conversion is extremely unlikely to occur during normal operation. If 0.4 or 0.7 were used instead of 0.5, the Poisson analysis would still indicate that the chance of having 8 collisions in a single bucket is still less than one in ten million.