Search code examples
c++dictionaryhashhashmap

How are values stored in unordered_map


In a std::unordered_map hash, the key is stored so it can find the value in O(1).

But what if a hash has two different keys, for example, and the key is a string, and they're equal to each other?

Can it cause a problem?


Solution

  • This is known as a collision. It is also the reason why we talk about "buckets" when talking about hash maps, because it is perfectly normal to have multiple keys with the same hash end up in the same "bucket".

    This means that lookup is always a two-stage process:

    1. Find the right bucket.
    2. Find the key in the bucket.

    Note also that even keys that have different hashes can still end up in the same bucket, because hashes are typically larger than the size of the table, so the hash table further reduces the hash to a smaller size, e.g. by taking it modulo the size of the table.

    There are multiple different strategies to deal with it, the two most popular are:

    • Separate Chaining: The bucket is itself another collection, such as a list or a tree. (Usually a list.) This means that the worst-case step complexity for lookup of the overall hash table is the worst-case step complexity for lookup of the bucket data structure, e.g. O(n) in case of a list, where the worst-case is that all keys have the same hash. (A list is usually chosen because it is very simple, and other data structures, while speeding up lookup a little usually have much more complex insert operations, and thus would slow that down too much.)
    • Open Addressing: Instead of putting multiple entries in the same bucket, we instead search for an empty bucket. There are multiple strategies:
      • Linear Probing: Try the next bucket, and then the next, and then the next, and so on.
      • Quadratic Probing: Try the next bucket, then the fourth, the ninth, the sixteenth, and so on.
      • Double Hashing: Use a second hash function and try the 1 * hash2(key)th, 2 * hash2(key)th, 3 * hash2(key)th, 4 * hash2(key)th bucket, and so on.

    In both cases, you will need to periodically re-size the hash table. In the second case, simply because at some point, it will be full. In both the first and the second case, to make the buckets smaller and the search faster.

    If you use exponential resizing, you will typically still have an amortized worst-case step complexity of O(1) for insertion, deletion, and lookup.