Wanting to understand "the hashing trick" I've written the following test code:
import pandas as pd
from sklearn.feature_extraction import FeatureHasher
test = pd.DataFrame({'type': ['a', 'b', 'c', 'd', 'e','f','g','h']})
h = FeatureHasher(n_features=4, input_type='string')
f = h.transform(test.type)
print(f.toarray())
In the above example, I'm mapping 8 categories into 4 columns, and the output is:
[[ 0. 0. 1. 0.]<-a
[ 0. -1. 0. 0.]<-b
[ 0. -1. 0. 0.]<-c
[ 0. 0. 0. 1.]<-d
[ 0. 0. 0. 1.]<-e
[ 0. 0. 0. 1.]<-f
[ 0. 0. -1. 0.]<-g
[ 0. -1. 0. 0.]]<-g
In the resulting matrix, I can see repetitions and some categories are represented the same way. Why is that? 8 categories can be mapped into 4 columns if I use a binary representation.
Can someone please explain the output of this technique and maybe elaborate a bit?
A FeatureHasher
will lead to undesired results if you set n_features
to such a low value. The reason for this is the way in which it maps categories to column indices.
As opposed to a CountVectorizer
for instance, where each category is assigned a unique integer index corresponding to a column just by order of occurrence, FeatureHasher
will apply a hash function to the features to determine the column index of each category. It's main advantage is hence an increased speed. However by limiting n_features
to such a low value, it is likely that the result of hashing a given category will result in an index higher than the set n_features
, and consequently what you'll get is a truncated feature vector.
We can actually check this by reproducing how the hashing is done in
_hashing_fast
which uses murmurhash3_bytes_s32
to generate the hashes:
from sklearn.utils.murmurhash import murmurhash3_bytes_s32
raw_X = test['type']
raw_X = iter(raw_X)
raw_X = (((f, 1) for f in x) for x in raw_X)
for x in raw_X:
for f, v in x:
f = f'{f}={v}'
fb = (f).encode("utf-8")
h = murmurhash3_bytes_s32(fb, seed=0)
print(f'{f[0]} -> {h}')
Which as you can see yields larger hash values precisely for e
and f
, which are truncated to the lower hash corresponding to d
:
a -> -424864564
b -> -992685778
c -> -1984769100
d -> 728527081
e -> 2077529484
f -> 2074045163
g -> -1877798433
h -> -51608576