I have multiple columns consisting of categorical variables which are in the form of integer values ranging from 0-4. But, all columns belong to the same category. I tried using OneHotEncoder from scikit learn but it does not take care of missing categories in the column, which would cause problems when I test unseen data on my neural network model. Below code shows the kind of data I need to encode
>>> df = pd.DataFrame(np.random.randint(low=0, high=4, size=(5, 5)),
columns=['color1', 'color2', 'color3', 'color4', 'color5'])
>>> df
color1 color2 color3 color4 color5
0 0 1 2 3 1
1 3 1 0 1 1
2 0 1 0 3 0
3 0 2 0 1 2
4 0 2 0 3 2
>>> df_onehotencoder = OneHotEncoder(sparse=False)
>>> df2 = df_onehotencoder.fit_transform(df)
>>> df2
array([[1., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0.],
[0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 0.],
[1., 0., 1., 0., 1., 0., 0., 1., 1., 0., 0.],
[1., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1.],
[1., 0., 0., 1., 1., 0., 0., 1., 0., 0., 1.]])
This produces and array for each column only for the categories present in that column, and not for the missing categories. I need to have equal number of encoded columns for each column i.e. the missing category will be all zeroes then. Also, what would be the best option to decode this OneHotEncoded array so I could decode the predicted output into actual integer values easily.
Starting from sklearn==0.20
OneHotEncoder has categories
parameter where you can provide a list of lists with all the possible values for a given column.
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 1],
[3, 1, 0, 1, 1],
[0, 1, 0, 3, 0],
[0, 2, 0, 1, 2],
[0, 2, 0, 3, 2]], columns=['color1', 'color2', 'color3', 'color4', 'color5'])
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
# Get all the unique values if we don't have them
unique_values = pd.unique(df.values.ravel())
ohe = OneHotEncoder(categories=[unique_values]*df.shape[1], sparse=False)
encoded = pd.DataFrame(ohe.fit_transform(
df), columns=ohe.get_feature_names(df.columns))
>>> encoded
color1_0 color1_1 color1_2 color1_3 color2_0 color2_1 0
0 1.0 0.0 0.0 0.0 0.0 1.0 ...
1 0.0 0.0 0.0 1.0 0.0 1.0 ...
2 1.0 0.0 0.0 0.0 0.0 1.0 ...
3 1.0 0.0 0.0 0.0 0.0 0.0 ...
4 1.0 0.0 0.0 0.0 0.0 0.0 ...
To get back original classes you can do inverse_transform
:
>>> ohe.inverse_transform(encoded)
array([[0, 1, 2, 3, 1],
[3, 1, 0, 1, 1],
[0, 1, 0, 3, 0],
[0, 2, 0, 1, 2],
[0, 2, 0, 3, 2]], dtype=int64)