Search code examples
pythonpandasdataframescikit-learnfeature-extraction

Feature Hashing on multiple categorical features(columns)


I would like to hash feature ‘Genre’ into 6 columns and separately feature ‘Publisher’ into another six columns. I want something like below:

      Genre      Publisher  0    1    2    3    4    5      0    1    2    3    4    5 
0     Platform  Nintendo  0.0  2.0  2.0 -1.0  1.0  0.0    0.0  2.0  2.0 -1.0  1.0  0.0
1       Racing      Noir -1.0  0.0  0.0  0.0  0.0 -1.0   -1.0  0.0  0.0  0.0  0.0 -1.0
2       Sports     Laura -2.0  2.0  0.0 -2.0  0.0  0.0   -2.0  2.0  0.0 -2.0  0.0  0.0
3  Roleplaying      John -2.0  2.0  2.0  0.0  1.0  0.0   -2.0  2.0  2.0  0.0  1.0  0.0
4       Puzzle      John  0.0  1.0  1.0 -2.0  1.0 -1.0    0.0  1.0  1.0 -2.0  1.0 -1.0
5     Platform      Noir  0.0  2.0  2.0 -1.0  1.0  0.0    0.0  2.0  2.0 -1.0  1.0  0.0

The following code does what I want to do

import pandas as pd
d = {'Genre': ['Platform', 'Racing','Sports','Roleplaying','Puzzle','Platform'], 'Publisher': ['Nintendo', 'Noir','Laura','John','John','Noir']}
df = pd.DataFrame(data=d)
from sklearn.feature_extraction import FeatureHasher
fh1 = FeatureHasher(n_features=6, input_type='string')
fh2 = FeatureHasher(n_features=6, input_type='string')
hashed_features1 = fh.fit_transform(df['Genre'])
hashed_features2 = fh.fit_transform(df['Publisher'])
hashed_features1 = hashed_features1.toarray()
hashed_features2 = hashed_features2.toarray()
pd.concat([df[['Genre', 'Publisher']], pd.DataFrame(hashed_features1),pd.DataFrame(hashed_features2)],
          axis=1)

This works for the above two feature but If I have lets say 40 categorical features then this approach would be tedious. Is there any other way to do?


Solution

  • Hashing (Update)

    Assuming that new categories might show up in some of the features, hashing is the way to go. Just 2 notes:

    • Be aware of the possibility of collision and adjust the number of features accordingly
    • In your case, you want to hash each feature separately

    One Hot Vector

    In case the number of categories for each feature is fixed and not too large, use one hot encoding.

    I would recommend using either of the two:

    1. sklearn.preprocessing.OneHotEncoder
    2. pandas.get_dummies

    Example

    import pandas as pd
    from sklearn.compose import ColumnTransformer
    from sklearn.feature_extraction import FeatureHasher
    from sklearn.preprocessing import OneHotEncoder
    
    df = pd.DataFrame({'feature_1': ['A', 'G', 'T', 'A'],
                       'feature_2': ['cat', 'dog', 'elephant', 'zebra']})
    
    # Approach 0 (Hashing per feature)
    n_orig_features = df.shape[1]
    hash_vector_size = 6
    ct = ColumnTransformer([(f't_{i}', FeatureHasher(n_features=hash_vector_size, 
                            input_type='string'), i) for i in range(n_orig_features)])
    
    res_0 = ct.fit_transform(df)  # res_0.shape[1] = n_orig_features * hash_vector_size
    
    # Approach 1 (OHV)
    res_1 = pd.get_dummies(df)
    
    # Approach 2 (OHV)
    res_2 = OneHotEncoder(sparse=False).fit_transform(df)
    

    res_0 :

    array([[ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1., -1.,  0., -1.],
           [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  2., -1.,  0.,  0.,  0.],
           [ 0., -1.,  0.,  0.,  0.,  0., -2.,  2.,  2., -1.,  0., -1.],
           [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  2.,  1., -1.,  0., -1.]])
    

    res_1 :

       feature_1_A  feature_1_G  feature_1_T  feature_2_cat  feature_2_dog  feature_2_elephant  feature_2_zebra
    0            1            0            0              1              0                   0                0
    1            0            1            0              0              1                   0                0
    2            0            0            1              0              0                   1                0
    3            1            0            0              0              0                   0                1
    

    res_2 :

    array([[1., 0., 0., 1., 0., 0., 0.],
           [0., 1., 0., 0., 1., 0., 0.],
           [0., 0., 1., 0., 0., 1., 0.],
           [1., 0., 0., 0., 0., 0., 1.]])