Search code examples
c++hashtablehash-function

Hash Function for a 7 digits int


I'm new to hash tables and functions, so I apologize in advance if I got anything wrong.

I'm trying to create a hash table in C++ for a list of about 100k entries comprised of a 7 digit number. The thing is, I got stuck while trying to figure out what hash function to use.

When using %100000 I got ~65k unique keys, while there are ~90k unique entries. Which means that about 1/3 of the data will have collisions.

Is this a good hash function to use? Or is there a better function to use in that case in order to have less collisions? What size should my table be?

Thanks again!

Edit- The entries are numbers between 1 and 2 mil. Is it possible to use the number itself as the key? Or should keys for hash table always start at 0?


Solution

  • The standard library comes with two types internally using a hash table: std::unordered_map and std::unordered_set.

    As your key type is an integral type you get a hash table pretty conveniently by std::unordered_map<YourIdType, YourDataType. You can easily access the data via theMap[someId], but be aware that if the key is not found a new object is created! If that's not desired you'd rather use theMap.find(someId), which returns an iterator.

    The drawback, though, is that you store the id twice then (internally as a std::pair<YourIdType, YourDataType>). You can avoid that by using a std::unordered_set. To be able to do so, though, you need to specialise std::hash and std::equal_to for your type:

    namespace std // you are not allowed to add to – with exception of specialisations
    {
    template<>
    struct hash<YourDataType>
    {
        size_t operator()(YourDataType const& object) const
        {
            return hash<YourIdType>()(object.id);
        }
    };
    
    // analogously equal_to with appropriate comparisons, possibly by
    // comparing the object's addresses
    

    Alternatively you can provide a custom hasher type (with C++20 that can even be a lambda packed into decltype) to the set as second template parameter and just implement operator== for your object type, or provide a custom equality comparer type if you need it to compare differently than the operator, maybe like:

    // C++20 required:
    using YourMapType = std::set
    <
        YourDataType,
        decltype
        (
            [](YourDataType const& object)
            { return std::hash<YourIdType>()(object.id); }
        ),
        decltype
        (
            [](YourDataType const& o1, YourDataType const& o2)
            { return &o1 == &o2; } // TODO: comparisons as you need! 
        )
    >;
    // alternatively create custom types with appropriate operator() implementations
    

    Drawback here is – apart from additional complexity for the specialisations – that you cannot lookup objects by the id only, instead you need a complete object of your data type.

    So which one is more appropriate/suitable? That depends on your concrete requirements...