python machine-learning scikit-learn feature-engineering

Hash trick in sklearn FeatureHasher

Wanting to understand "the hashing trick" I've written the following test code:

import pandas as pd
from sklearn.feature_extraction import FeatureHasher
test = pd.DataFrame({'type': ['a', 'b', 'c', 'd', 'e','f','g','h']})
h = FeatureHasher(n_features=4, input_type='string')
f = h.transform(test.type)
print(f.toarray())

In the above example, I'm mapping 8 categories into 4 columns, and the output is:

[[ 0.  0.  1.  0.]<-a
 [ 0. -1.  0.  0.]<-b
 [ 0. -1.  0.  0.]<-c
 [ 0.  0.  0.  1.]<-d
 [ 0.  0.  0.  1.]<-e
 [ 0.  0.  0.  1.]<-f
 [ 0.  0. -1.  0.]<-g
 [ 0. -1.  0.  0.]]<-g

In the resulting matrix, I can see repetitions and some categories are represented the same way. Why is that? 8 categories can be mapped into 4 columns if I use a binary representation.

Can someone please explain the output of this technique and maybe elaborate a bit?

Solution

A FeatureHasher will lead to undesired results if you set n_features to such a low value. The reason for this is the way in which it maps categories to column indices.

As opposed to a CountVectorizer for instance, where each category is assigned a unique integer index corresponding to a column just by order of occurrence, FeatureHasher will apply a hash function to the features to determine the column index of each category. It's main advantage is hence an increased speed. However by limiting n_features to such a low value, it is likely that the result of hashing a given category will result in an index higher than the set n_features, and consequently what you'll get is a truncated feature vector.

We can actually check this by reproducing how the hashing is done in _hashing_fast which uses murmurhash3_bytes_s32 to generate the hashes:

from sklearn.utils.murmurhash import murmurhash3_bytes_s32

raw_X = test['type']
raw_X = iter(raw_X)
raw_X = (((f, 1) for f in x) for x in raw_X)

for x in raw_X:
    for f, v in x:
        f = f'{f}={v}'
        fb = (f).encode("utf-8")
        h = murmurhash3_bytes_s32(fb, seed=0)
        print(f'{f[0]} -> {h}')

Which as you can see yields larger hash values precisely for e and f, which are truncated to the lower hash corresponding to d:

a -> -424864564
b -> -992685778
c -> -1984769100
d -> 728527081
e -> 2077529484
f -> 2074045163
g -> -1877798433
h -> -51608576