Search code examples
data-structuresbig-ohashtable

Why is the Big-O complexity for Small Load Factor HashTables O(1)?


I'm having trouble understanding using the load factor for big-O analysis of Chained and Open-Addressing of HashTables.

From my understanding:

LoadFactor = (# of entries in HashTable)/(# of "slots" in HashTable) = (n/m)

Thus, the LoadFactor reflects how much of the HashTable is being utilized by data being entered into the HashTable.

For a Chained HashTable, the Worst-Case Time Complexity is O(n) because a non-uniform distribution of data with all elements hashed to the last slot in the HashTable reduces the problem to a search in a Linked List of size n.

For an Open-Addressed HashTable, the Worst-Case Time Complexity is O(n) because once again, a non-uniform distribution of data with all elements hashed to one hashCode will result in all elements being entered consecutively. Thus, the problem reduces to a search in an array of size n.

For the worst-case scenarios, I assumed n>m.

Now for a small load factor, both the Chained and Open-Addressed HashTables will yield an O(1).

I fail to see the distinction between the n>m and n

Why is this the case?


Solution

  • When n is much smaller than m (the number of buckets), then it's likely that every key has hashed to a unique bucket. As n increases, the likelihood of a collision (two keys hashing to the same bucket) increases. When n > m, then by the Pigeonhole principle, it's guaranteed that at least two keys will hash to the same value.

    So when n is much smaller than m, the likelihood of collision is smaller, so the expected lookup time is O(1). As the number of items grows, you spend more time resolving collisions.

    Empirical evidence suggests that you don't want your load factor to exceed 0.75. Probably closer to 0.70. When n > 0.70*m, the hash table becomes hugely inefficient. The actual number varies depending on your data, of course, but that's a good rule of thumb.

    The Birthday problem shows how the collision rate increases as n approaches m. The likelihood of encountering a single collision is close to 50% when you have inserted the square root of the number of items in the table. If you were to create a hash table of size 365 and start inserting items, your chances of seeing a hash collision is higher than 50% when you insert just 25 items. (That's assuming a good hashing function. A poor hashing function will increase your likelihood of collision overall.)