Search code examples
pythonsettime-complexitycollision

Explain worst case time complexity of looking up item in Python set and dict is O(n)


I am seeking to understand why the amortised worst-case time complexity of looking up an item in a set (item in set), as well as getting an item in a dictionary (d.get(item)), is O(N), while usually (in an average case scenario) it just takes O(1), as the underlying data structure of set and dictionary is hash table.

Existing posts on Stack Overflow related to this topic either doesn't tackle this question, or is too advanced for a beginner like me. A solution on one post wrote that the time complexity of item in set is O(n) when "your hash table's load factor is too high, then you face collisions". I thought that a set didn't allow duplicate values to be added, and it would also ignore any duplicate values created initially. How does a collision happen in a set and how exactly does that make the time complexity go from O(1) to O(n)? How about in a dictionary?

What is the conceptual explanation? What would some simple code examples be?


To explain it simply for people who are also confused about collisions and duplicates like me before this, the values don't need to be duplicates to have hash collision.

For example, "apple" and "banana" are not duplicates, so they can be added to the same set, but depending on the hash function, the hash values that are computed for "apple" and for "banana" separately may be the same, which leads to a hash collision.

The answers and comments below explain why a hash collision leads to longer searching time, and to O(N) in the worst case when every value in a set collides.


Solution

  • The collision is mainly due to the functions that Python uses for hashing the values in a set or dictionary. In hashing, a hash function is used to map an item to a unique hash value. Ideally, each item should have a unique hash value, and the hash table (used internally by sets and dictionaries) would have one item per bucket. However, due to the nature of hash functions and the finite number of possible hash values, collisions can occur.

    For example, let's say that your set/dict have two items, "apple" and "banana", and their hash values are calculated to be the same. In such cases, the hash table needs to handle these collisions and resolve them.

    Python's implementation of sets and dictionaries uses a technique called separate chaining to handle hash collisions. Each bucket in the hash table contains a linked list of items that have collided. When a collision occurs, the new item is appended to the linked list in that bucket.

    Now, when you perform a lookup operation on a set or dictionary, Python uses the hash value of the item you're searching for to locate the corresponding bucket. It then traverses the linked list in that bucket to find the exact match. If there are multiple items in the linked list, Python needs to iterate through them until it finds the desired item or reaches the end of the list.

    Therefore, in the worst-case scenario, all items in your set/dict are having the same hash value, and then the function will need to search through that single bucket containing all items in your set/dict, and that's why the complexity is O(n). Again this is the worst-case scenario and it's very rare.