Search code examples
python-3.xpandasscikit-learndata-scienceone-hot-encoding

One Hot Encoding a single column


I am trying to use one hot encoder on the target column('Species') in the Iris dataset.

But I am getting the following errors:

ValueError: Expected 2D array, got 1D array instead:

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm    Species
0   1   5.1 3.5 1.4         0.2     Iris-setosa
1   2   4.9 3.0 1.4         0.2     Iris-setosa
2   3   4.7 3.2 1.3         0.2     Iris-setosa
3   4   4.6 3.1 1.5         0.2     Iris-setosa
4   5   5.0 3.6 1.4         0.2     Iris-setosa

I did google the issue and i found that most of the scikit learn estimators need a 2D array rather than a 1D array.

At the same time, I also found that we can try passing the dataframe with its index to encode single columns, but it didn't work

onehotencoder = OneHotEncoder(categorical_features=[df.columns.tolist().index('pattern_id')
X = dataset.iloc[:,1:5].values
y = dataset.iloc[:, 5].values

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder= LabelEncoder()
y = labelencoder.fit_transform(y)


onehotencoder = OneHotEncoder(categorical_features=[0])
y = onehotencoder.fit_transform(y)

I am trying to encode a single categorical column and split into multiple columns (the way the encoding usually works)


Solution

  • ValueError: Expected 2D array, got 1D array instead: Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

    Says that you need to convert your array to a vector. You can do that by:

    from sklearn import datasets
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    import pandas as pd
    import numpy as np
    
    # load iris dataset 
    >>> iris = datasets.load_iris()
    >>> iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
    >>> y = iris.target.values
    >>> onehotencoder = OneHotEncoder(categories='auto')
    >>> y = onehotencoder.fit_transform(y.reshape(-1,1))
    # y - will be sparse matrix of type '<class 'numpy.float64'>
    # if you want it to be a array you need to 
    >>> print(y.toarray())
    [[1. 0. 0.]
     [1. 0. 0.]
        . . . . 
     [0. 0. 1.]
     [0. 0. 1.]]
    

    Also you can use get_dummies function (docs)

    >>> pd.get_dummies(iris.target).head()
       0.0  1.0  2.0
    0    1    0    0
    1    1    0    0
    2    1    0    0
    3    1    0    0
    4    1    0    0
    

    Hope that helps!