Search code examples
sequencesone-hot-encoding

How to create one hot encoding of a sequence of sequences


I want to make one hot encoding of a data set which looks like [[5,7,11,9,13,1,...],[3,7,5,9,16,....],..]; where length of each sequence is 24 and maximum possible integer in each sequence is 33 and the total number of sequences is 200. Each sequence is an integer representation of a sentence. How i can make efficient one hot encoding of this?? I have tried

for sentence in sentences:    
n=maxlen    
k=max_vocabullary    
m=np.zeros((n,k))    
m[np.arange(n),sentence]=1     
print (m)  

Solution

  • Try Scikit-learn's OneHotEncoder.

    from sklearn.preprocessing import OneHotEncoder
    enc = OneHotEncoder()
    encoded_seqs = enc.fit_transform([[5,7,11,9,13,1,...],[3,7,5,9,16,....],..])
    

    http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html