Search code examples
pythonscikit-learnone-hot-encoding

Encoding with OneHotEncoder


I'm trying to preprossessing data with the OneHotEncoder of scikitlearn. Obviously, I'm doing something wrong. Here is my sample program :

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer


cat = ['ok', 'ko', 'maybe', 'maybe']


label_encoder = LabelEncoder()
label_encoder.fit(cat)


cat = label_encoder.transform(cat)

# returns [2 0 1 1], which seams good.
print(cat)

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

res = ct.fit_transform([cat])

print(res)

Final result : [[1.0 0 1 1]]

Expected result : something like :

[
 [ 1 0 0 ]
 [ 0 0 1 ]
 [ 0 1 0 ]
 [ 0 1 0 ]
]

Can someone point out what I'm missing ?


Solution

  • You can consider to using numpy and MultiLabelBinarizer.

    import numpy as np
    from sklearn.preprocessing import MultiLabelBinarizer
    
    cat = np.array([['ok', 'ko', 'maybe', 'maybe']])
    
    m = MultiLabelBinarizer()
    print(m.fit_transform(cat.T))
    

    If you still want to stick with your solution. You just need to update as the following:

    # because of it still a row, not a column
    # res = ct.fit_transform([cat])  => remove this
    
    # it should works
    res = ct.fit_transform(np.array([cat]).T)
    
    Out[2]:
    array([[0., 0., 1.],
           [1., 0., 0.],
           [0., 1., 0.],
           [0., 1., 0.]])