Search code examples
c++boosthash

Faster computation of hash value for a large vector of doubles


I need to calculate the hash value for a large vector. Currently I do it with a for-loop and boost::hash_combine as shown below, but it is too slow - it takes close to 10ms for a 500000 size vector, but ideally I'd like to bring it down to 1ms or lesser. Is there a way to compute the hash faster(in one-shot perhaps?) for contents in a contiguous block of memory like a vector instead of having to parse through the entire vector with a for-loop?

#include <random>
#include <algorithm>
#include <functional> // bind
#include <iterator>
#include <iostream>
#include <vector>
#include <chrono>
#include <boost/functional/hash.hpp>

using namespace std;

int main () 
{

    vector<double> myContainer(500000, 0.0);
    
    uniform_real_distribution<double> unif(0.0,1.0);
    mt19937 re(std::random_device{}());
    auto generator = std::bind(unif, std::ref(re));
    generate(begin(myContainer), end(myContainer), generator);
    
    cout << "myContainer[0] = " << myContainer[0] << ", myContainer[L-1] = " << myContainer[myContainer.size()-1] << std::endl;
    
    size_t      hashValBoost(0);     // type supported by boost::hash_combine
    uint64_t startTime_us = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now().time_since_epoch()).count();
    for(size_t i=0; i<myContainer.size(); ++i)
    {
        boost::hash_combine(hashValBoost, myContainer[i]);
    }
    uint64_t endTime_us = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now().time_since_epoch()).count();

    cout << "ContainerSize = " << myContainer.size() <<"; Container hash = " << hashValBoost << ", TimeToComputeHash(ms) = " << (endTime_us - startTime_us)/1000.0 << "ms" << std::endl;
    return 0;
}

EDIT: I am constrained to build without turning optimizations on and with C++11 or older compiler.

Run code here: https://wandbox.org/permlink/gxVxZ8QE53DhtZde


Solution

  • If your vector does not contain NaNs or negative zeroes, you can leverage more heavily optimized byte array hashing algorithms:

    std::size_t hash_bytes(std::span<const std::byte> sp) {
        std::string_view sv(reinterpret_cast<const char*>(sp.data()), sp.size())
        return std::hash<std::string_view>{}(sv);
        
        // Or some other byte based hashing algorithm, like
        
        // boost::hash_range is optimized for std::byte and unsigned char
        return boost::hash_range(sp.begin(), sp.end());
    }
    
    // Usage: hashVal = hash_bytes(std::as_bytes(std::span(myContainer)))
    

    Boost also has boost::hash_value(const std::vector<T>&) or boost::hash_range(It, It), which internally does the hash_combine loop. It's about ~4x slower, but will still work if you have negative zeroes/NaNs.