Search code examples
c++hashsetunordered-set

How to count collisions in unordered_set c++


I want to calculate some stats about my hash function (like max/avg amount of collision). I wrote dummy hash function (which mapped all keys to 1) and waited to see number of max/avg collisions equal to amount of keys. But I have equal numbers for different functions. Can someone explain this? Code:

#include <iostream>
#include <unordered_set>

struct DummyHash
{
    size_t operator()(int key) const
    {
        return static_cast<size_t>(1);
    }
};

int main()
{
    std::unordered_set<int, DummyHash> a;
    std::unordered_set<int> b;
    int c = 10000;

    for (int i = 0; i < c; i++)
    {
        a.insert(i);
    }
    std::cout << "a ended" << std::endl;

    for (int i = 0; i < c; i++)
    {
        b.insert(i);
    }

    std::cout << "b ended" << std::endl;

    std::cout << "a = " << a.max_load_factor() << ' ' << a.load_factor() << ' '
        << a.max_size() << ' ' << a.max_bucket_count() << ' ' << a.bucket_count() << '\n';
    std::cout << "b = " << b.max_load_factor() << ' ' << b.load_factor() << ' '
        << b.max_size() << ' ' << b.max_bucket_count() << ' ' << b.bucket_count() << '\n';
    return 0;
}

Result:

a ended
b ended
a = 1 0.659065 768614336404564650 768614336404564650 15173
b = 1 0.659065 1152921504606846975 1152921504606846975 15173

Solution

  • std::unordered_map will increase bucket_count in an attempt to keep load_factor near max_load_factor.

    That means that bucket_count depends only on the number of elements in the map, and is unaffected by the number of collisions.

    To check for collisions, count all elements that have a bucket size > 1.

    size_t collisions = 0, empty = 0;
    for (auto bucket = a.bucket_count(); bucket--;) {
        if (a.bucket_size(bucket) == 0)
            empty++;
        else
            collisions += a.bucket_size(bucket) - 1;
    }
    std::cout << "a = " << a.max_load_factor() << ' ' << a.load_factor() << ' '
        << ' ' << a.bucket_count() << ' ' << collisions << ' ' << empty << '\n';
    empty = 0, collisions = 0;
    for (auto bucket = b.bucket_count(); bucket--;) {
        if (b.bucket_size(bucket) == 0)
            empty++;
        else
            collisions += b.bucket_size(bucket) - 1;
    }
    std::cout << "b = " << b.max_load_factor() << ' ' << b.load_factor() << ' '
        << ' ' << b.bucket_count() << ' ' << collisions << ' ' << empty << '\n';
    

    Prints

    a = 1 0.610352  16384 9999 16383
    b = 1 0.610352  16384 4773 11157
    

    That is, with a bad hashing function there are 9999 collisions and 16383 out of 16384 empty buckets.


    Unrelated: if you care about hash table performance, have a look at dense_hash_map, which implements linear probing for much better performance.