Search code examples
scikit-learnfeature-extractioncategorical-data

Issue with OneHotEncoder for categorical features


I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing from sklearn.preprocessing to do so as the following:

from sklearn import preprocessing
cat_features = ['color', 'director_name', 'actor_2_name']
enc = preprocessing.OneHotEncoder(categorical_features=cat_features)
enc.fit(dataset.values)

However, I couldn't proceed as I am getting this error:

    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: PG

I am surprised why it is complaining about the string as it is supposed to convert it!! Am I missing something here?


Solution

  • If you read the docs for OneHotEncoder you'll see the input for fit is "Input array of type int". So you need to do two steps for your one hot encoded data

    from sklearn import preprocessing
    cat_features = ['color', 'director_name', 'actor_2_name']
    enc = preprocessing.LabelEncoder()
    enc.fit(cat_features)
    new_cat_features = enc.transform(cat_features)
    print new_cat_features # [1 2 0]
    new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape
    ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read
    print ohe.fit_transform(new_cat_features)
    

    Output:

    [[ 0.  1.  0.]
     [ 0.  0.  1.]
     [ 1.  0.  0.]]
    

    EDIT

    As of 0.20 this became a bit easier, not only because OneHotEncoder now handles strings nicely, but also because we can transform multiple columns easily using ColumnTransformer, see below for an example

    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    import numpy as np
    
    X = np.array([['apple', 'red', 1, 'round', 0],
                  ['orange', 'orange', 2, 'round', 0.1],
                  ['bannana', 'yellow', 2, 'long', 0],
                  ['apple', 'green', 1, 'round', 0.2]])
    ct = ColumnTransformer(
        [('oh_enc', OneHotEncoder(sparse=False), [0, 1, 3]),],  # the column numbers I want to apply this to
        remainder='passthrough'  # This leaves the rest of my columns in place
    )
    print(ct2.fit_transform(X)) # Notice the output is a string
    

    Output:

    [['1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1' '0']
     ['0.0' '0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '0.0' '1.0' '2' '0.1']
     ['0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1.0' '0.0' '2' '0']
     ['1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1' '0.2']]