I've created a model using Keras Sequential API, and using Glove pretraining embeddings
def create_model(
input_length=20,
output_length=20):
encoder_input = tf.keras.Input(shape=(input_length,))
decoder_input = tf.keras.Input(shape=(output_length,))
encoder = tf.keras.layers.Embedding(original_embedding_matrix.shape[0], original_embedding_dim, weights=[original_embedding_matrix], mask_zero=True)(encoder_input)
encoder, h_encoder, u_encoder = tf.keras.layers.LSTM(64, return_state=True)(encoder)
decoder = tf.keras.layers.Embedding(clone_embedding_matrix.shape[0], clone_embedding_dim, weights=[clone_embedding_matrix], mask_zero=True)(decoder_input)
decoder = tf.keras.layers.LSTM(64, return_sequences=True)(decoder, initial_state=[h_encoder, u_encoder])
decoder = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(clone_vocab_size+1))(decoder)
model = tf.keras.Model(inputs=[encoder_input, decoder_input], outputs=[decoder])
model.compile(optimizer='adam', loss=tf.keras.losses.MeanSquaredError(), metrics=['accuracy'])
return model
model = create_model()
Here are my encoder/decoder shapes:
training_encoder_input.shape --> (2500, 20)
training_decoder_input.shape --> (2500, 20)
training_decoder_output.shape ---> (2500, 20, 11272)
clone_vocab_size ---> 11271
Ouput of model.summary()
:
Model: "functional_1"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 20)] 0
__________________________________________________________________________________________________
input_2 (InputLayer) [(None, 20)] 0
__________________________________________________________________________________________________
embedding (Embedding) (None, 20, 50) 564800 input_1[0][0]
__________________________________________________________________________________________________
embedding_1 (Embedding) (None, 20, 50) 563600 input_2[0][0]
__________________________________________________________________________________________________
lstm (LSTM) [(None, 64), (None, 29440 embedding[0][0]
__________________________________________________________________________________________________
lstm_1 (LSTM) (None, 20, 64) 29440 embedding_1[0][0]
lstm[0][1]
lstm[0][2]
__________________________________________________________________________________________________
time_distributed (TimeDistribut (None, 20, 11272) 732680 lstm_1[0][0]
==================================================================================================
Total params: 1,919,960
Trainable params: 1,919,960
Non-trainable params: 0
__________________________________________________________________________________________________
But when I try to train the model:
model.fit(x=[training_encoder_input, training_decoder_input],
y=training_decoder_output,
verbose=2,
batch_size=128,
epochs=10)
I get this error:
InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: indices[28,0] = 11292 is not in [0, 11272)
[[node functional_1/embedding_1/embedding_lookup (defined at <ipython-input-11-967d0351a90e>:31) ]]
(1) Invalid argument: indices[28,0] = 11292 is not in [0, 11272)
[[node functional_1/embedding_1/embedding_lookup (defined at <ipython-input-11-967d0351a90e>:31) ]]
[[broadcast_weights_1/assert_broadcastable/AssertGuard/else/_13/broadcast_weights_1/assert_broadcastable/AssertGuard/Assert/data_7/_78]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_13975]
Errors may have originated from an input operation.
Input Source operations connected to node functional_1/embedding_1/embedding_lookup:
functional_1/embedding_1/embedding_lookup/8859 (defined at /usr/lib/python3.6/contextlib.py:81)
Input Source operations connected to node functional_1/embedding_1/embedding_lookup:
functional_1/embedding_1/embedding_lookup/8859 (defined at /usr/lib/python3.6/contextlib.py:81)
Function call stack:
train_function -> train_function
Someone already asked this question but none of the reponses worked for me, probably the error is within the loss function or within the vocabulary of the embedding layer, but I can't figure out what's exactly the problem.
The solution is pretty simple in fact, in the error:
(0) Invalid argument: indices[28,0] = 11292 is not in [0, 11272)
11292
is an input element (mapped to a word in my Tokenizer dictionary)11272
is the lenght of my vocabularyWhy do I have a word with number 11292
if the length of my tokenizer is just 11272
?
You can also limit the number of words used in the tokenizer in Tensorflow:
tokenizer = Tokenizer(num_words=20000)
and it will take the 20000 most repeated words.