In Keras, I can have the following code:
docs
Out[9]:
['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.']
labels = array([1,1,1,1,1,0,0,0,0,0])
voc_size = 50
encoded = [one_hot(d, voc_size) for d in docs]
max_length = 4
padded_docs = pad_sequences(encoded, maxlen=max_length, padding='post')
My understanding is that, the 'one_hot' encoding already creates an equal length of each doc based on the vocabulary size. So why does each doc need to be padded again?
EDIT: another example for more clarification:
A one-hot encoding is a representation of categorical variables (e.g. cat, dog, rat) as binary vectors (e.g. [1,0,0], [0,1,0], [0,0,1]).
So in this case, cat, dog and rat are encoded as equal length of vector. How is this different from the example above?
TLDR; one_hot
makes each index to be from a fixed range, no the result list to have a fixed length.
In order to understand that issue one need to understand what the one_hot
function actually does. It transforms a document into a sequence of int
indices which has approximately the same length as the number of words (tokens) in document. E.g.:
'one hot encoding' -> [0, 2, 17]
where each index is index of a word in vocabulary (e.g. one
has index 0
). This means that when you apply one_hot
to a sequence of texts (as in the piece of code you have provided) you are getting the list of lists indices where each list might have different length. This is a problem for keras
and numpy
which expects the list of lists to be in array-like form - which means that each sublist should have the equal, fixed length.
This is done via pad_sequences
function which makes each of the sublists to have a fixed length.