Search code examples
javaalgorithmhashmaphashtableclrs

Why Hashtable's load factor is not consistent with the one described in CLRS book?


From the doc of Java about Hashtable class, it says

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs

So the load factor for Hashtable is 0.75, which means if there are N keys, Hashtable will use M = N/0.75 spaces to store them.

In CLRS book, it also introduces load factor alpha.

But from my understanding, CLRS intends to set alpha larger than 1, i.e., M = N/alpha < N. This means a Hashtable can use M slots where M < N so that it can save storage from unused keys.

I say M < N can save storage because normally we don't know exactly value for N, but we know the set of the keys and use N to stand for possible number of keys. The set of the keys may be very big, but the number of actual keys in use is very small. So setting M smaller than N can save the storage. Also this is why normally Hashtable does not use a direct array to map every {key, value} 1:1.

But Hashtable in Java use storage more than N. I think it is not consistent with the CLRS design, right?

Am I right?

thanks


Solution

  • Well, the load factor should be larger than the elements added. Division by a number less than one results in a larger number than the initial one.

    Assuming you want to add 100 elements you can write either:

    AllocationSize = 100 / 0.75; // Your formula: M = N/0.75 
    

    or

    AllocationSize = 100 * 1.33333333; // M = N / X -> M = N * (1/X)
    

    where both results in 133.333333 -> 133.

    The whole JavaDoc:

    An instance of Hashtable has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. Note that the hash table is open: in the case of a "hash collision", a single bucket stores multiple entries, which must be searched sequentially. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hashtable exceeds the product of the load factor and the current capacity, the capacity is increased by calling the rehash method.

    Generally, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the time cost to look up an entry (which is reflected in most Hashtable operations, including get and put).

    The initial capacity controls a tradeoff between wasted space and the need for rehash operations, which are time-consuming. No rehash operations will ever occur if the initial capacity is greater than the maximum number of entries the Hashtable will contain divided by its load factor. However, setting the initial capacity too high can waste space.

    If many entries are to be made into a Hashtable, creating it with a sufficiently large capacity may allow the entries to be inserted more efficiently than letting it perform automatic rehashing as needed to grow the table.