python machine-learning scikit-learn prediction one-hot-encoding

Explain onehotencoder using python

I am new to scikit-learn library and have been trying to play with it for prediction of stock prices. I was going through its documentation and got stuck at the part where they explain OneHotEncoder(). Here is the code that they have used :

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Can someone please explain it to me step by step what is happening here? I have a clear idea how One hot encoder works but I'm not able to figure out how this code works. Any help is appreciated. Thanks!

Solution

Lets start off first by writing down what you would expect (assuming you know what One Hot Encoding means)

unecoded

f0 f1 f2
0, 0, 3
1, 1, 0
0, 2, 1
1, 0, 2

encoded

|f0|  |  f1 |  |   f2   |

1, 0, 1, 0, 0, 0, 0, 0, 1 
0, 1, 0, 1, 0, 1, 0, 0, 0
1, 0, 0, 0, 1, 0, 1, 0, 0
0, 1, 1, 0, 0, 0, 0, 1, 0

To get encoded:

enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]),

if you use the default n_values='auto'. In using default='auto' you're specifying that the values your features (columns of unencoded) could possibly take on can be inferred from the values in the columns of the data handed to fit.

That brings us to enc.n_values_

from the docs:

Number of values per feature.

enc.n_values_
array([2, 3, 4])

The above means that f0 (column 1) can take on 2 values (0, 1), f1 can take on 3 values, (0, 1, 2) and f2 can take on 4 values (0, 1, 2, 3).

Indeed these are the values from the features f1, f2 ,f3 in the unencoded feature matrix.

then,

enc.feature_indices_
array([0, 2, 5, 9])

from the docs:

Indices to feature ranges. Feature i in the original data is mapped to features from feature_indices_[i] to feature_indices_[i+1] (and then potentially masked by active_features_ afterwards)

Given is the range of positions (in the encoded space) that features f1, f2, f3 can take on.

f1: [0, 1], f2: [2, 3, 4], f3: [5, 6, 7, 8]

Mapping the vector [0, 1, 1] into one hot encoded space (under the mapping by we got from enc.fit):

1, 0, 0, 1, 0, 0, 1, 0, 0

How?

The first feature in the f0 so that maps to position 0 (if the element was 1 instead of 0 we would map it into position 1).

The next element 1 maps into position 3 because f1 starts at position 2 and the element 1 is the second possible value f1 can take on.

Finally the third element 1 takes on position 6 since it the second possible value f2 takes on and f2 starts getting mapped from position 5.

Hope that clears up some stuff.