Search code examples
pythonscikit-learncategorical-dataone-hot-encodingcategorization

OneHotEncoder on multiple columns belonging to same categories


I have multiple columns consisting of categorical variables which are in the form of integer values ranging from 0-4. But, all columns belong to the same category. I tried using OneHotEncoder from scikit learn but it does not take care of missing categories in the column, which would cause problems when I test unseen data on my neural network model. Below code shows the kind of data I need to encode

>>> df = pd.DataFrame(np.random.randint(low=0, high=4, size=(5, 5)),
                       columns=['color1', 'color2', 'color3', 'color4', 'color5'])
>>> df

   color1  color2  color3  color4  color5
0       0       1       2       3       1
1       3       1       0       1       1
2       0       1       0       3       0
3       0       2       0       1       2
4       0       2       0       3       2

>>> df_onehotencoder = OneHotEncoder(sparse=False)
>>> df2 = df_onehotencoder.fit_transform(df)

>>> df2

array([[1., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0.],
       [0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 0.],
       [1., 0., 1., 0., 1., 0., 0., 1., 1., 0., 0.],
       [1., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 1., 0., 0., 1., 0., 0., 1.]])

This produces and array for each column only for the categories present in that column, and not for the missing categories. I need to have equal number of encoded columns for each column i.e. the missing category will be all zeroes then. Also, what would be the best option to decode this OneHotEncoded array so I could decode the predicted output into actual integer values easily.


Solution

  • Starting from sklearn==0.20 OneHotEncoder has categories parameter where you can provide a list of lists with all the possible values for a given column.

    import pandas as pd
    df = pd.DataFrame([[0, 1, 2, 3, 1],
     [3, 1, 0, 1, 1],
     [0, 1, 0, 3, 0],
     [0, 2, 0, 1, 2],
     [0, 2, 0, 3, 2]], columns=['color1', 'color2', 'color3', 'color4', 'color5'])
    
    from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
    
    # Get all the unique values if we don't have them
    unique_values = pd.unique(df.values.ravel()) 
    
    ohe = OneHotEncoder(categories=[unique_values]*df.shape[1], sparse=False)
    encoded = pd.DataFrame(ohe.fit_transform(
        df), columns=ohe.get_feature_names(df.columns))
    >>> encoded
    
       color1_0  color1_1  color1_2  color1_3  color2_0  color2_1    0
    0       1.0       0.0       0.0       0.0       0.0       1.0  ...
    1       0.0       0.0       0.0       1.0       0.0       1.0  ...
    2       1.0       0.0       0.0       0.0       0.0       1.0  ...
    3       1.0       0.0       0.0       0.0       0.0       0.0  ...
    4       1.0       0.0       0.0       0.0       0.0       0.0  ...
    

    To get back original classes you can do inverse_transform:

    >>> ohe.inverse_transform(encoded) 
    array([[0, 1, 2, 3, 1],
           [3, 1, 0, 1, 1],
           [0, 1, 0, 3, 0],
           [0, 2, 0, 1, 2],
           [0, 2, 0, 3, 2]], dtype=int64)