Search code examples
pythonmachine-learningscikit-learnfeature-engineering

Hash trick in sklearn FeatureHasher


Wanting to understand "the hashing trick" I've written the following test code:

import pandas as pd
from sklearn.feature_extraction import FeatureHasher
test = pd.DataFrame({'type': ['a', 'b', 'c', 'd', 'e','f','g','h']})
h = FeatureHasher(n_features=4, input_type='string')
f = h.transform(test.type)
print(f.toarray())

In the above example, I'm mapping 8 categories into 4 columns, and the output is:

[[ 0.  0.  1.  0.]<-a
 [ 0. -1.  0.  0.]<-b
 [ 0. -1.  0.  0.]<-c
 [ 0.  0.  0.  1.]<-d
 [ 0.  0.  0.  1.]<-e
 [ 0.  0.  0.  1.]<-f
 [ 0.  0. -1.  0.]<-g
 [ 0. -1.  0.  0.]]<-g

In the resulting matrix, I can see repetitions and some categories are represented the same way. Why is that? 8 categories can be mapped into 4 columns if I use a binary representation.

Can someone please explain the output of this technique and maybe elaborate a bit?


Solution

  • A FeatureHasher will lead to undesired results if you set n_features to such a low value. The reason for this is the way in which it maps categories to column indices.

    As opposed to a CountVectorizer for instance, where each category is assigned a unique integer index corresponding to a column just by order of occurrence, FeatureHasher will apply a hash function to the features to determine the column index of each category. It's main advantage is hence an increased speed. However by limiting n_features to such a low value, it is likely that the result of hashing a given category will result in an index higher than the set n_features, and consequently what you'll get is a truncated feature vector.


    We can actually check this by reproducing how the hashing is done in _hashing_fast which uses murmurhash3_bytes_s32 to generate the hashes:

    from sklearn.utils.murmurhash import murmurhash3_bytes_s32
    
    raw_X = test['type']
    raw_X = iter(raw_X)
    raw_X = (((f, 1) for f in x) for x in raw_X)
    
    for x in raw_X:
        for f, v in x:
            f = f'{f}={v}'
            fb = (f).encode("utf-8")
            h = murmurhash3_bytes_s32(fb, seed=0)
            print(f'{f[0]} -> {h}')
    

    Which as you can see yields larger hash values precisely for e and f, which are truncated to the lower hash corresponding to d:

    a -> -424864564
    b -> -992685778
    c -> -1984769100
    d -> 728527081
    e -> 2077529484
    f -> 2074045163
    g -> -1877798433
    h -> -51608576