Search code examples
machine-learningscikit-learntext-classificationdata-processingone-hot-encoding

Data Preparation for training


I am trying to prepare the data file by creating one hot encoding of the text of characters using which I can later train my model for classification. I have a training data file which consists of lines of characters and I am doing initially the integer encoding of them and then the one hot encoding.

e.g. this is how the data file looks:

  1. afafalkjfalkfalfjalfjalfjafajfaflajflajflajfajflajflajfjaljfafj
  2. fgtfafadargggagagagagagavcacacacarewrtgwgjfjqiufqfjfqnmfhbqvcqvfqfqafaf
  3. fqiuhqqhfqfqfihhhhqeqrqtqpocckfmafaflkkljlfabadakdpodqpqrqjdmcoqeijfqfjqfjoqfjoqgtggsgsgqr

This is how I am approaching it:

import pandas as pd
from sklearn import preprocessing

categorical_data = pd.read_csv('abc.txt', sep="\n", header=None)
labelEncoder = preprocessing.LabelEncoder()
X = categorical_data.apply(labelEncoder.fit_transform)
print("Afer label encoder")
print(X.head())

oneHotEncoder = preprocessing.OneHotEncoder()
oneHotEncoder.fit(X)

onehotlabels = oneHotEncoder.transform(X).toarray()
print("Shape after one hot encoding:", onehotlabels.shape)

print(onehotlabels)

I am getting the integer encoding for each line (0,1,2 in my case) and then the subsequent one hot encoded vector.

My question is that how do I do it for each character in an individual line as for prediction, the model should learn from the characters in one line( which corresponds to a certain label). Can someone give me some insight on how to proceed from there?


Solution

  • Given your example I end up with a DataFrame like so:

        0
    0   0
    1   1
    2   2
    

    From your description it sounds like you want each line to have its own independent one hot encoding. So lets first look at line 1.

    afafalkjfalkfalfjalfjalfjafajfaflajflajflajfajflajflajfjaljfafj
    

    The reason you are getting the dataframe I included above is that this line is getting read into the DataFrame and then passed to the labelEncoder and oneHotEncoder as a single value, not an array of 63 values (the length of the string).

    What you really want to do is pass the labelEncoder an array of size 63.

    data = np.array([let for let in categorical_data[0][0]])
    X = labelEncoder.fit_transform(data)
    oneHotEncoder.fit(X.reshape(-1,1))
    row_1_labels = oneHotEncoder.transform(X.reshape(-1,1)).toarray()
    row_1_labels
    
    array([[ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  1.],
           [ 0.,  0.,  0.,  1.,  0.],
           [ 0.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  1.],
           [ 0.,  0.,  0.,  1.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  1.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 0.,  0.,  1.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  1.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 0.,  0.,  1.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  1.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 0.,  0.,  1.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  1.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  1.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  1.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  1.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  1.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 0.,  0.,  1.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  1.],
           [ 0.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 0.,  1.,  0.,  0.,  0.],
           [ 0.,  0.,  1.,  0.,  0.]])
    

    You could repeat this for each row to get the independent one hot encodings. Like so:

    one_hot_encodings = categorical_data.apply(lambda x: [oneHotEncoder.fit_transform(labelEncoder.fit_transform(np.array([let for let in x[0]])).reshape(-1,1)).toarray()], axis=1)
    one_hot_encodings
    
                                                        0
    0   [[1.0, 0.0, 0.0, 0.0, 0.0], [0.0, 1.0, 0.0, 0....
    1   [[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,...
    2   [[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,...
    

    If you wanted the rows to be one hot encoded based on the values found in all rows you would just first fit the labelEncoder to all of the unique letters and then do the transformations for each row. Like so:

    unique_letters = np.unique(np.array([let for row in categorical_data.values for let in row[0]]))
    labelEncoder.fit(unique_letters)
    unique_nums = labelEncoder.transform(unique_letters)
    oneHotEncoder.fit(unique_nums.reshape(-1,1))
    cat_dat = categorical_data.apply(lambda x: [np.array([let for let in x[0]])], axis=1)
    one_hot_encoded = cat_dat.apply(lambda x: [oneHotEncoder.transform(labelEncoder.transform(x[0]).reshape(-1,1)).toarray()], axis=1)
    one_hot_encoded
    
                                                        0
    0   [[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
    1   [[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,...
    2   [[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,...
    

    This will return you a DataFrame with each row containing the one hot encoded array of letters based on the letters from all rows.