I'd really like to know which hash function is used for feature hashing in Vowpal Wabbit.
I know that the underlying algorithm is Murmurhash 3, but I couldn't make out the details by looking at the VW code on github.
Does anyone know the exactly which hash function is used in VW?
The core hash function is the 32-bit Murmur Hash v 3.0 by Austin Appleby.
However, it is incorporated with a slight API change from the original as uniform_hash
to adapt it to varying usage scenarios in vw
.
You may look at the source at:
hashstrings
and hashall
functions)As you can see, things are not so simple when you dig into details. The reasons it isn't simple, is that in some places additional effort was made to improve dispersion when features and name-spaces are combined in interactions and some of the reduction algorithms.
Here's a (possibly not 100% complete) list of the details:
-b bits
, default 18) in order to fit into the weight-vector so the value you get from the hash can be smaller than the straight/naive hash.--hash strings
), feature names that start with a digit are initially used as-is (no hash) but if they continue with some non-digit, the current value computed so far is used as a seed, and the rest of the name is murmur-32-hashed--redefine
, -q
, --cubic
, etc.) are used the two hash results are combined with a different simpler hash.--search
, a specific (non-zero) seed is used when feeding murmur-hash (see: vowpalwabbit/search.cc
) so you may get a different hash value than what's expected.In case of doubt, you may use the --audit
option and vw
will output the exact hash-value of each feature in each example. The format is (example):
#
# UserJack1^mean_karma:3864409:0.12345:0.919323[@3.8964]
# ^^^^+^^^^ ^^^^^+^^^^ ^^^+^^^ ^^^+^^^ ^^^^+^^^ ^^^+^^^
# | | | | | |
# namespace featurename hashval value weight Sum(gradients)
#