I would like to hash feature ‘Genre’ into 6 columns and separately feature ‘Publisher’ into another six columns. I want something like below:
Genre Publisher 0 1 2 3 4 5 0 1 2 3 4 5
0 Platform Nintendo 0.0 2.0 2.0 -1.0 1.0 0.0 0.0 2.0 2.0 -1.0 1.0 0.0
1 Racing Noir -1.0 0.0 0.0 0.0 0.0 -1.0 -1.0 0.0 0.0 0.0 0.0 -1.0
2 Sports Laura -2.0 2.0 0.0 -2.0 0.0 0.0 -2.0 2.0 0.0 -2.0 0.0 0.0
3 Roleplaying John -2.0 2.0 2.0 0.0 1.0 0.0 -2.0 2.0 2.0 0.0 1.0 0.0
4 Puzzle John 0.0 1.0 1.0 -2.0 1.0 -1.0 0.0 1.0 1.0 -2.0 1.0 -1.0
5 Platform Noir 0.0 2.0 2.0 -1.0 1.0 0.0 0.0 2.0 2.0 -1.0 1.0 0.0
The following code does what I want to do
import pandas as pd
d = {'Genre': ['Platform', 'Racing','Sports','Roleplaying','Puzzle','Platform'], 'Publisher': ['Nintendo', 'Noir','Laura','John','John','Noir']}
df = pd.DataFrame(data=d)
from sklearn.feature_extraction import FeatureHasher
fh1 = FeatureHasher(n_features=6, input_type='string')
fh2 = FeatureHasher(n_features=6, input_type='string')
hashed_features1 = fh.fit_transform(df['Genre'])
hashed_features2 = fh.fit_transform(df['Publisher'])
hashed_features1 = hashed_features1.toarray()
hashed_features2 = hashed_features2.toarray()
pd.concat([df[['Genre', 'Publisher']], pd.DataFrame(hashed_features1),pd.DataFrame(hashed_features2)],
axis=1)
This works for the above two feature but If I have lets say 40 categorical features then this approach would be tedious. Is there any other way to do?
Hashing (Update)
Assuming that new categories might show up in some of the features, hashing is the way to go. Just 2 notes:
One Hot Vector
In case the number of categories for each feature is fixed and not too large, use one hot encoding.
I would recommend using either of the two:
sklearn.preprocessing.OneHotEncoder
pandas.get_dummies
Example
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction import FeatureHasher
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'feature_1': ['A', 'G', 'T', 'A'],
'feature_2': ['cat', 'dog', 'elephant', 'zebra']})
# Approach 0 (Hashing per feature)
n_orig_features = df.shape[1]
hash_vector_size = 6
ct = ColumnTransformer([(f't_{i}', FeatureHasher(n_features=hash_vector_size,
input_type='string'), i) for i in range(n_orig_features)])
res_0 = ct.fit_transform(df) # res_0.shape[1] = n_orig_features * hash_vector_size
# Approach 1 (OHV)
res_1 = pd.get_dummies(df)
# Approach 2 (OHV)
res_2 = OneHotEncoder(sparse=False).fit_transform(df)
res_0
:
array([[ 0., 0., 0., 0., 1., 0., 0., 0., 1., -1., 0., -1.],
[ 0., 0., 0., 1., 0., 0., 0., 2., -1., 0., 0., 0.],
[ 0., -1., 0., 0., 0., 0., -2., 2., 2., -1., 0., -1.],
[ 0., 0., 0., 0., 1., 0., 0., 2., 1., -1., 0., -1.]])
res_1
:
feature_1_A feature_1_G feature_1_T feature_2_cat feature_2_dog feature_2_elephant feature_2_zebra
0 1 0 0 1 0 0 0
1 0 1 0 0 1 0 0
2 0 0 1 0 0 1 0
3 1 0 0 0 0 0 1
res_2
:
array([[1., 0., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0., 1., 0.],
[1., 0., 0., 0., 0., 0., 1.]])