Search code examples
c++hashhashtableunordered-maphash-function

Create custom Hash Function


I tried to implement an unordered map for a Class called Pair, that stores an integer and a bitset. Then I found out, that there isn't a hashfunction for this Class. Now I wanted to create my own hashfunction. But instead of using the XOR function or comparable functions, I wanted to have a hashfunction like the following approach:

the bitsets in my class obviously have fixed size, so I wanted to do the following:

example: for a instance of Pair with the bitset<6> = 101101, and the integer 6:

  • create a string = "1011016"
  • and now use the default hashfunction on this string
  • because the bitsets have fixed size, each key would be unique

how could I implement this approach?

thank you in advance


Solution

  • To expand on a comment, as requested:

    Converting to string and then hashing that string would be somewhat slow. At least slower than it needs to be. A faster approach would be to combine the bit patterns, e.g. like this:

    struct Pair
    {
      std::bitset<6> bits;
      int intval;
    };
    
    template<>
    std::hash<Pair>
    {
      std::size_t operator()(const Pair& pair) const noexcept
      {
         std::size_t rtrn = static_cast<std::size_t>(pair.intval);
         rtrn = (rtrn << pair.bits.size()) | pair.bits.to_ulong(); 
         return rtrn;
      }
    };
    

    This works on two assumptions:

    1. The upper bits of the integer are generally not interesting
    2. The size of the bitset is always small compared to size_t

    I think it is a suitable hash function for use in unordered_map. One may argue that it has poor mixing and a very good hash should change many bits if only a few bits in its input change. But that is not required here. unordered_map is generally designed to work with cheap hash functions. For example GCC's hash for builtin types and pointers is just a static- or reinterpret-cast.

    Possible improvements

    We can preserve the upper bits by rotating instead of shifting.

    template<>
    std::hash<Pair>
    {
      std::size_t operator()(const Pair& pair) const noexcept
      {
         std::size_t rtrn = static_cast<std::size_t>(pair.intval);
         std::size_t intdigits = std::numeric_limits<decltype(pair.intval)>::digits;
         std::size_t bitdigits = pair.bits.size();
         // can  be simplified to std::rotl(rtrn, bitdigits) in C++20
         rtrn = (rtrn << bitdigits) | (rtrn >> (intdigits - bitdigits));
         rtrn ^= pair.bits.to_ulong();
         return rtrn;
      }
    };
    

    Nothing will change for small integers (except some bitflips for small negative ints). But for large integers we still use the whole range of inputs, which might be of interest for pathological cases such as integer series 2^30, 2^30 + 2^29, 2^30 + 2^28, ...

    If the size of the bitset may increase, stop doing fancy stuff and just combine the hashes. I wouldn't just xor them to avoid hash collisions on small integers.

    std::hash<Pair>
    {
      std::size_t operator()(const Pair& pair) const noexcept
      {
         std::hash<decltype(pair.intval)> ihash;
         std::hash<decltype(pair.bits)> bhash;
         return ihash(pair.intval) * 31 + bhash(pair.bits);
      }
    };
    

    I picked the simple polynomial hash approach common in Java. I believe GCC uses the same one internally for string hashing. Someone else may expand on the topic or suggest a better one. 31 is commonly chosen as it is a prime number one off a power of two. So it can be computed quickly as (x << 5) - x