Reading :
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
it states "encode categorical integer features using a one-hot aka one-of-K scheme."
Does this also mean it one-hot encodes a list of words ?
From Wikipedia definition ( https://en.wikipedia.org/wiki/One-hot ) of one hot encoding
"In natural language processing, a one-hot vector is a 1 × N matrix (vector) used to distinguish each word in a vocabulary from every other word in the vocabulary. The vector consists of 0s in all cells with the exception of a single 1 in a cell used uniquely to identify the word."
Running code below it appears LabelEncoder
is not a correct implementation of one hot encoding whereas OneHotEncoder
is a correct implementation :
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# define example
data = ['w1 w2 w3', 'w1 w2']
values = array(data)
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
mlb = MultiLabelBinarizer()
print('fit_transform\n' , mlb.fit_transform(data))
print('\none hot\n' , onehot_encoder.fit_transform(integer_encoded))
Prints :
fit_transform
[[1 1 1 1 1]
[1 1 1 0 1]]
one hot
[[0. 1.]
[1. 0.]]
So LabelEncoder
does not one-hot encode , what is the type of encoding used by LabelEncoder
?
From above outputs it appears OneHotEncoder
produces a more dense vector than encoding scheme of LabelEncoder
.
Update :
How to decide to encode data for machine learning algorithms using LabelEncoder or OneHotEncoder ?
I think your question is not clear enough...
First, LabelEncoder
encodes labels with value between 0
and n_classes-1
while OneHotEncoder
encodes categorical integer features using a one-hot aka one-of-K scheme. They are different.
Second, yes OneHotEncoder
encodes a list of words. In Wikipedia definition, it says a one-hot vector is a 1 × N matrix
. But what is N
? Actually, N
is the size of your vocabulary.
For example, if you have five words a, b, c, d, e
. Then one-hot-encode them:
a -> [1, 0, 0, 0, 0] # a one-hot 1 x 5 vector
b -> [0, 1, 0, 0, 0] # a one-hot 1 x 5 vector
c -> [0, 0, 1, 0, 0] # a one-hot 1 x 5 vector
d -> [0, 0, 0, 1, 0] # a one-hot 1 x 5 vector
e -> [0, 0, 0, 0, 1] # a one-hot 1 x 5 vector
# total five one-hot 1 x 5 vectors which can be expressed in a 5 x 5 matrix.
Third, actually I'm not 100% sure what you are asking...
Finally, to answer your updated question. Most of time you should choose one-hot encoding or word embedding. The reason is, the vectors generated by LabelEncoder
are too similar which means there isn't much difference between each other. As the similar input are more likely to result similar output. That makes your model difficult to fit.