Search code examples
pythonmachine-learningone-hot-encoding

How to perform one hot encoding in python


Hi I've been trying to do this for the past few hours but I just can't seem to get it to work. I have imported the necessary packages and assigned my csv file to a variable X.

My csv file is one column with numbers ranging from 0 to 9 for each element. I would like to create another csv file with 10 columns with 0s and 1s to use as a target set. I've tried using sklearns labelencoder and onehotencoder but I haven't had any luck.

Thanks for reading and for any help in advance.


Solution

  • If it's in a csv file, you can use Pandas package in the following way

    import pandas as pd            #importing the package        
    df = pd.read_csv(path)         #df is a variable containing the data-frame of the csv file
    ydf = pd.get_dummies(df['label']) #'label' is the title of the the column
                                      #in the csv you want to one hot encode
    

    check the pandas dummy documentation

    If it's a numpy array you can try the following way

    import numpy as np
    vector = np.arange(5)    # vector = [0 1 2 3 4]
    
    one_hot = (vector == 0).astype(np.int)  #[1 0 0 0 0]
    one_hot = (vector == 2).astype(np.int)  #[0 0 1 0 0]
    one_hot = (vector == 4).astype(np.int)  #[0 0 0 0 1]
    

    so you can do that with your numpy array

    vector = np.arange(no_of_different_labels)
    
    # transform labels into one hot representation
    y_train_one_hot = (vector == y_train).astype(np.float)
    # make sure you y_train is of size (m,1) and not (m,) for broadcasting to work
    

    got it from this link