Search code examples
javadata-structurestime-complexityhashtableamortized-analysis

How is the Hashtable get(kType key) amortized time complexity O(1) and not O(log n)?


The index that a key is associated with is generally, in the most simplest implementation of a hash table, retrieved in the following way:

size++;
int hash = hashcode(key);
int index = hash % size;

For an arbitrary key we can say that the index will be an integer in the range [0, size - 1] with equal probability for each outcome. These probabilities are described by the table below for the first 5 indices after adding N elements.

Index            |   0              1               2              3                4
--------------------------------------------------------------------------------------------
Probabilities    |  1
                 |  1/2             1/2
                 |  1/3             1/3             1/3
                 |  1/4             1/4             1/4             1/4
                 |  1/5             1/5             1/5             1/5             1/5    
                 |                                    ...
                 |  1/N             1/N             1/N             1/N             1/N
____________________________________________________________________________________________
Total            |  H(N)         H(N) - 1        H(N) - 1.5      H(N) - 1.83     H(N) - 2.08

H(N) describes how many elements should collect in the chain at index 0. Every chain afterwards should have statistically fewer elements.

H(N) is also the value of the harmonic series up to and including the Nth term. Although there is no generalized closed form for describing the harmonic series, this value can be approximated very accurately using the following formula,

H(N) ≈ ln(N) + 0.5772156649 + 1 / (2N) - 1 / (12N^2)

Reference: https://math.stackexchange.com/questions/496116/is-there-a-partial-sum-formula-for-the-harmonic-series

The "approximation" part can be attributed to the terms after ln(N) + 0.5772156649. ln(N) is the largest function and thus the amortized time complexity should be O(log n).

Is there something I am missing? I would greatly appreciate clarification here.


Solution

  • Expanding my comment into an answer - let's start here:

    The index that a key is associated with is generally, in the most simplest implementation of a hash table, retrieved in the following way:

    size++;
    int hash = hashcode(key);
    int index = hash % size;
    

    This actually isn't how most hash tables are implemented. Rather, most hash tables use a strategy like the following:

    1. Pick some fixed starting number of slots (say, 8, or 16, etc.)
    2. When an element is added, place that element in slot hashcode(key) % tablesize.
    3. Once the ratio of the number of items in the table to the number of slots exceeds a threshold called the load factor, perform a rehash: double the size of the table and redistribute the existing elements by recomputing hashcode(key) % tablesize with the new table size. (This last step ensures that items can still be found given that the table has been resized, and ensures that the items are distributed across the whole table, not just the first few slots.)

    The exact analysis of how fast this is will depend on how you implement the hash table. If you use chained hashing (each item is dropped into a slot and then stored in a separate array or linked list containing all the items in that slot) and your hash function is "more or less" uniformly random, then (intuitively) the items will probably be distributed more or less uniformly across the table slots. You can do a formal analysis of this by assuming you indeed have a random hash function, in which case the expected number of items in any one slot is at most the table's load factor (the ratio of the number of items to the number of slots). The load factor is typically chosen to be a constant, which means that the expected number of items per slot is upper-bounded by that constant, hence the claim about O(1) expected lookup times. (You can also get similar O(1) bounds on the expected costs of lookups for linear probing hash tables, but the math is much more involved.)

    The "amortized" part comes in because of the rehashing step. The idea is that, most of the time, insertions don't push the load factor above the threshold needed for a rehash, so they're pretty fast. But every now and then you do have to rebuild the table. Assuming you double the table's size, you can show that each rehash is proceeded by a linear number of insertions that didn't trigger a rehash, so you can backcharge the work required to do the rehash to the previous operations.