Search code examples
pythonmachine-learningkerasnlpone-hot-encoding

Why is does pad_sequences necessary when one_hot encoding is used?


In Keras, I can have the following code:

docs                                                                                                                                                                                                                                                                                                                                                                
Out[9]: 
['Well done!',
 'Good work',
 'Great effort',
 'nice work',
 'Excellent!',
 'Weak',
 'Poor effort!',
 'not good',
 'poor work',
 'Could have done better.']

labels = array([1,1,1,1,1,0,0,0,0,0]) 
voc_size = 50
encoded = [one_hot(d, voc_size) for d in docs] 
max_length = 4 
padded_docs = pad_sequences(encoded, maxlen=max_length, padding='post')  

My understanding is that, the 'one_hot' encoding already creates an equal length of each doc based on the vocabulary size. So why does each doc need to be padded again?

EDIT: another example for more clarification:

A one-hot encoding is a representation of categorical variables (e.g. cat, dog, rat) as binary vectors (e.g. [1,0,0], [0,1,0], [0,0,1]).

So in this case, cat, dog and rat are encoded as equal length of vector. How is this different from the example above?


Solution

  • TLDR; one_hot makes each index to be from a fixed range, no the result list to have a fixed length.

    In order to understand that issue one need to understand what the one_hot function actually does. It transforms a document into a sequence of int indices which has approximately the same length as the number of words (tokens) in document. E.g.:

    'one hot encoding' -> [0, 2, 17]
    

    where each index is index of a word in vocabulary (e.g. one has index 0). This means that when you apply one_hot to a sequence of texts (as in the piece of code you have provided) you are getting the list of lists indices where each list might have different length. This is a problem for keras and numpy which expects the list of lists to be in array-like form - which means that each sublist should have the equal, fixed length.

    This is done via pad_sequences function which makes each of the sublists to have a fixed length.