Search code examples
pandasnumpyscikit-learnnumpy-ndarrayone-hot-encoding

Display feature names in columns after using One Hot encoding


I have one column in a csv which are the names of fruits which I want to convert into an array.

Sample csv column:

Names:
Apple
Banana
Pear
Watermelom
Jackfruit
..
..
..

There are around 400 fruit names in the column

I have used one hot encoding for the same but unable to display the column names(each fruit name from a row of the csv column)

My code till now is:

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

dataset = pd.read_csv('D:/fruits.csv')
X= dataset.iloc[:, 0].values


labelencoder_X = LabelEncoder()
D= labelencoder_X.fit_transform(X)
D = D.reshape(-1, 1)

onehotencoder = OneHotEncoder(sparse=False, categorical_features = [0])
X = onehotencoder.fit_transform(D)

This converts the data of the column into a numpy array but the columns names are coming as [0 1 2 3 .. ..] which I want as each row name of the csv, example [Apple Banana Pear Watermelon .. .. ]

How can I retain the column names after using one hot encoding


Solution

  • Orignal Answer:

    A rather efficient way to OneHotEncode would be to use pd.get_dummies. I've applied on sample data:

    data = {'Names':['Apple','Banana','Pear', 'Watermelon']}
    df = pd.DataFrame(data=data)
    
    df_new = pd.get_dummies(df)
    print(df_new) 
    

    Orignal df:

            Names
    0       Apple
    1      Banana
    2        Pear
    3  Watermelon
    

    Encoded df:

       Names_Apple  Names_Banana  Names_Pear  Names_Watermelon
    0            1             0           0                 0
    1            0             1           0                 0
    2            0             0           1                 0
    3            0             0           0                 1
    

    Edit:

    Let's assume that our dataframe contains 2 Categorical & 2 Numeric features. We just want to OneHotEncode 1 of the 2 Categorical columns.

    Generating dummy Data:

    data = {'Names':['Apple','Banana','Pear', 'Watermelom'],
            'Category' :['A','B','A','B'],
            'Val1':[10,20,30,30],
            'Val2':[60,70,80,90]}
    df = pd.DataFrame(data=data)
    
            Names Category  Val1  Val2
    0       Apple        A    10    60
    1      Banana        B    20    70
    2        Pear        A    30    80
    3  Watermelom        B    30    90
    

    If we just want to OneHotEncode Names we would do that by

    df_new = pd.get_dummies(df, columns=['Names'])
    print(df_new)
    

    You can refer to this documentation. By defining columns we would only encode columns of interest.

    Encoded Output:

      Category  Val1  Val2  Names_Apple  Names_Banana  Names_Pear  Names_Watermelom
    0        A    10    60            1             0           0                 0
    1        B    20    70            0             1           0                 0
    2        A    30    80            0             0           1                 0
    3        B    30    90            0             0           0                 1