Search code examples
c++hashc++11user-defined-literals

Compile time consistent hash of char types


I want to implement of sort of Symbol the same way ruby does.

For this, I created a user defined literal which returned a std::hash of the std::basic_string<T> corresponding.

The code was great, but as I read somewhere the hash function may not be consistent over several executions of the same program. Moreover, I wanted to make this computation at compile-time, which was 1) not supported by std::hash and 2) would break code if std::hash return value changes.

So I wrote up the following implementation, based on java.lang.String.hashCode implementation.

typedef size_t symbol;

template<typename CharT>
constexpr size_t constant_hash(const CharT* p, size_t h = 0) noexcept
{
    return (*p == 0) ? h : constant_hash(p + 1, h * 31 + static_cast<size_t>(*p));
}

constexpr symbol operator "" _sym (const char* p, size_t n) noexcept
{
    return constant_hash(p);
}

My question is: are there any problem with this implementation?

I'm only able to test it on GCC 4.7.1, and I'd like to know if it is standard-compliant and should also work on other compiler.

I'm asking this because a previous implementation was working on GCC but caused segfault if the binary was compiled with clang++ (problem of undefined behavior with increment operators I think).

Thanks in advance

Edit

Work with clang++ (thanks KennyTM)


Solution

  • There's no UB, it works fine, as long as the string has '\0' termination. Note that constexpr evaluation cannot invoke UB; arithmetic or pointer operations that cause UB at runtime are required to produce a compile error in the context of a constant-expression.

    Note that the static_cast is unnecessary; the char operand will be promoted to size_t.

    Also, on first glance the hash function doesn't look very good because h * 31 is just ( h << 5 ) - h. You might pick a larger number with 1's randomly spread out across entire the binary value. But on the other hand, they might be trying to be clever since the low 5 bits of ASCII has the most entropy and this eliminates the possibility of collisions between short strings of different length.