Search code examples
c++boosthashhashcode

Hashing string values results in different outputs on Linux vs Windows in C++


I am trying to hash single std::string and std::vector<std::string> values. I am following the examples from cppreference and boost. When the code is compiled and run on Windows vs Linux, I get different results.

The header file for MyHasher.h is as follows.

class MyHasher {
 private:
  MyHasher() = delete;
 public:
  static std::size_t hash(std::vector<std::string> ids);
  static std::size_t hash(std::string s);
  static void hashCombine(std::size_t &seed, std::size_t value);
};

The CPP file MyHasher.cpp is as follows.

std::size_t MyHasher::hash(std::vector<std::string> ids) {
  std::size_t seed = 0;
  for (auto id : ids) {
    std::size_t h = std::hash<std::string>{}(id);
    hashCombine(seed, h);
  }
  return seed;
}
std::size_t MyHasher::hash(std::string s) {
  std::size_t seed = 0;
  std::size_t h = std::hash<std::string>{}(s);
  hashCombine(seed, h);
  return seed;
}
void MyHasher::hashCombine(std::size_t &seed, std::size_t value) {
  seed ^= value + 0x9e3779b9 + (seed<<6) + (seed>>2);
}

A sample program then runs as follows.

int main() {
  std::cout << std::to_string(MyHasher::hash("0")) << " | 0" << std::endl;
  std::cout << std::to_string(MyHasher::hash(std::vector<std::string>{"0"})) << " | 0" << std::endl;
  std::cout << std::to_string(MyHasher::hash(std::vector<std::string>{"0", "1"})) << " | 0 1" << std::endl;
  return 0;
}

On Linux (g++ 7.4.0), the output is as follows.

2297668036269395695 | 0
2297668036269395695 | 0
10545066640295778616 | 0 1

On Windows (Visual Studio Community 2019, MSVC-14.0), the output is as follows.

12638135526163551848 | 0
12638135526163551848 | 0
1964774108746342951 | 0 1

Any ideas on this discrepancy?

What I really want is a way to always produce an unique hash output that is dependent on the input, but cross-platform and fixed width. The width is not important, per say, but as long as it is the same width regardless of the input(s).


Solution

  • In the doc. of std::hash, it is explicitly mentioned that:

    The actual hash functions are implementation-dependent

    Hash functions are only required to produce the same result for the same input within a single execution of a program;

    I'm a bit uncertain about a hash function which always returns identical hashs for identical input. I googled a bit but failed to find something which I would dare to present.

    Assuming that the std library of MS VC++ and g++ may be different implementations, it cannot be expected to produce identical hashs for identical input.

    Reading the second part of cite carefully, you even cannot expect that the same program results in identical hashs for identical input in distinct processes (e.g. when started, exited, and started again).


    Cryptographic hash functions might be a solution:

    • it is deterministic, meaning that the same message always results in the same hash
    • it is quick to compute the hash value for any given message
    • it is infeasible to generate a message that yields a given hash value
    • it is infeasible to find two different messages with the same hash value a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value (avalanche effect)

    Checksums are related to hash functions. For a checksum, it must be required to result in identical output for identical input (to be reliable).

    So, a checksum implementation based on a hash function should match the requirement of OP as well.

    The accepted answer to SSE: Which hashing algorithm shoud I use for a safe file checksum? recommends SHA256 or SHA512.

    This remembered me that I recently heard that git uses a variation of SHA-1 but git may use SHA256 as well as well which appears to me as similar use case to what (I assume) OP might have.