Search code examples
pythonscikit-learnone-hot-encoding

OneHotEncoding transformation interpretation


I'm trying to understand the output of the onehotencoding process via python and scikit-learn. I believe that I get the idea of one hot encoding. I.e., convert discrete values into extended feature vectors with a value of 'on' to identify membership of a classification. Perhaps I got that wrong, which is confusing me but that's my understanding.

So, from the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

I see the following example:

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Could someone please explain how the data [[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]] ends up being transformed into [[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]]?

How is the transformation argument [0, 1, 1] used?

Many thanks for any help with this

Jon


Solution

  • So... after further digging, here is my attempt at clarifying one way of understanding this and answering it for others.

    1) The original data set is [0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]

    2) You then need to reduce these down (by position) to a list of unique ordered values:

    So...

    For position 1 (0, 1, 0, 1) --> [0, 1]
    For position 2 (0, 1, 2, 0) --> [0, 1, 2]
    For position 3 (3, 0, 1, 2) --> [0, 1, 2, 3]
    

    Now, when transforming this, you simply compare each positional item in the transformed array to the position in the list of unique ordered items

    For the transformed array [0, 1, 1]

    The first '0' generates a [1, 0] ('0' matches value in position one, not position two)
    The next '1' generates a [0, 1, 0] ('1' only matches value in position two)
    the last '1' generates a [0, 1, 0, 0] ('1' only matches value in position two)
    

    Put together, this equates to a [1, 0, 0, 1, 0, 0, 1, 0, 0].

    I've tried this with a number of other data sets, and the logic is consistent.