Search code examples
tensorflownlpword2vecembeddingword-embedding

Tensorflow embeddings InvalidArgumentError: indices[18,16] = 11905 is not in [0, 11905) [[node sequential_1/embedding_1/embedding_lookup


I am using TF 2.2.0 and trying to create a Word2Vec CNN text classification model. But however I tried there has been always an issue with the model or embedding layers. I could not found clear solutions in the internet so decided to ask it.

import multiprocessing
modelW2V = gensim.models.Word2Vec(filtered_stopwords_list, size= 100, min_count = 5, window = 5, sg=0, iter = 10, workers= multiprocessing.cpu_count() - 1)
model_save_location = "3000tweets_notbinary"
modelW2V.wv.save_word2vec_format(model_save_location)

word2vec = {}
with open('3000tweets_notbinary', encoding='UTF-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vec = np.asarray(values[1:], dtype='float32')
        word2vec[word] = vec

num_words = len(list(tokenizer.word_index))

embedding_matrix = np.random.uniform(-1, 1, (num_words, 100))
for word, i in tokenizer.word_index.items():
    if i < num_words:
        embedding_vector = word2vec.get(word)
        if embedding_vector is not None:
          embedding_matrix[i] = embedding_vector
        else:
          embedding_matrix[i] = np.zeros((100,))

I have created my word2vec weights by the code above and then converted it to embedding_matrix as I followed on many tutorials. But since there are a lot of words seen by word2vec but not available in embeddings, if there is no embedding I assign 0 vector. And then fed data and this embedding to tf sequential model.

seq_leng = max_tokens
vocab_size = num_words
embedding_dim = 100
filter_sizes = [3, 4, 5]
num_filters = 512
drop = 0.5
epochs = 5
batch_size = 32

model = tf.keras.models.Sequential([
                                    tf.keras.layers.Embedding(input_dim= vocab_size,
                                                              output_dim= embedding_dim,
                                                              weights = [embedding_matrix],
                                                              input_length= max_tokens,
                                                              trainable= False),
                                    tf.keras.layers.Conv1D(num_filters, 7, activation= "relu", padding= "same"),
                                    tf.keras.layers.MaxPool1D(2),
                                    tf.keras.layers.Conv1D(num_filters, 7, activation= "relu", padding= "same"),
                                    tf.keras.layers.MaxPool1D(),
                                    tf.keras.layers.Dropout(drop),
                                    tf.keras.layers.Flatten(),
                                    tf.keras.layers.Dense(32, activation= "relu", kernel_regularizer= tf.keras.regularizers.l2(1e-4)),
                                    tf.keras.layers.Dense(3, activation= "softmax")
])

model.compile(loss= "categorical_crossentropy", optimizer= tf.keras.optimizers.Adam(learning_rate= 0.001, epsilon= 1e-06),
              metrics= ["accuracy", tf.keras.metrics.Precision(), tf.keras.metrics.Recall()])

model.summary()

history = model.fit(x_train_pad, y_train2, batch_size= 60, epochs= epochs, shuffle= True, verbose= 1)

But when I run this code, tensorflow gives me the following error in any random time of the training process. But I could not find any solution to it. I have tried adding + 1 to vocab_size but when I do that I get size mismatch error which does not let me even compile my model. Can anyone please help me?

InvalidArgumentError:  indices[18,16] = 11905 is not in [0, 11905)
     [[node sequential_1/embedding_1/embedding_lookup (defined at <ipython-input-26-ef1b16cf85bf>:1) ]] [Op:__inference_train_function_1533]

Errors may have originated from an input operation.
Input Source operations connected to node sequential_1/embedding_1/embedding_lookup:
 sequential_1/embedding_1/embedding_lookup/991 (defined at /usr/lib/python3.6/contextlib.py:81)

Function call stack:
train_function

Solution

  • I solved this solution. I was adding a new dimension to vocab_size by doing it vocab_size + 1 as suggested by others. However, since sizes of layer dimensions and embedding matrix don't match I got this issue in my hands. I added a zero vector at the end of my embedding matrix which solved the issue.