Invalid argument: indices[0,0] = -4 is not in [0, 40405)

I have a model that was kinda working on some data. I've added in some tokenized word data in the dataset (somewhat truncated for brevity):

vocab_size = len(tokenizer.word_index) + 1
comment_texts = df.comment_text.values

tokenizer = Tokenizer(num_words=num_words)

tokenizer.fit_on_texts(comment_texts)
comment_seq = tokenizer.texts_to_sequences(comment_texts)
maxtrainlen = max_length(comment_seq)
comment_train = pad_sequences(comment_seq, maxlen=maxtrainlen, padding='post')
vocab_size = len(tokenizer.word_index) + 1

df.comment_text = comment_train

x = df.drop('label', 1) # the thing I'm training

labels = df['label'].values  # Also known as Y

x_train, x_test, y_train, y_test = train_test_split(
    x, labels, test_size=0.2, random_state=1337)        

n_cols = x_train.shape[1]

embedding_dim = 100  # TODO: why?

model = Sequential([
            Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_shape=(n_cols,)),
            LSTM(32),
            Dense(32, activation='relu'),
            Dense(512, activation='relu'),
            Dense(12, activation='softmax'),  # for an unknown type, we don't account for that while training
        ])
model.summary()

model.compile(optimizer='rmsprop',
                      loss='categorical_crossentropy',
                      metrics=['acc'])

# convert the y_train to a one hot encoded variable
encoder = LabelEncoder()
encoder.fit(labels)  # fit on all the labels
encoded_Y = encoder.transform(y_train)  # encode on y_train
one_hot_y = np_utils.to_categorical(encoded_Y)

model.fit(x_train, one_hot_y, epochs=10, batch_size=16)

Now, I get this error:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 12, 100)           4040500   
_________________________________________________________________
lstm (LSTM)                  (None, 32)                17024     
_________________________________________________________________
dense (Dense)                (None, 32)                1056      
_________________________________________________________________
dense_1 (Dense)              (None, 512)               16896     
_________________________________________________________________
dense_2 (Dense)              (None, 12)                6156      
=================================================================
Total params: 4,081,632
Trainable params: 4,081,632
Non-trainable params: 0
_________________________________________________________________
Train on 4702 samples
Epoch 1/10
2020-03-04 22:37:59.499238: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Invalid argument: indices[0,0] = -4 is not in [0, 40405)

I think this must be coming from my comment_text column since that is the only thing I added.

Here is what comment_text looks like before I make the substitution:

And here is after:

My full code (before I made the change) is here: https://colab.research.google.com/drive/1y8Lhxa_DROZg0at3VR98fi5WCcunUhyc#scrollTo=hpEoqR4ne9TO

Solution

You should be training with comment_train, not with x which is taking whatever is in the unknown df.

The embedding_dim=100 is free to choose. It's like the number of units in a hidden layer. You can tune this parameter to find which is best for your model as well as you can tune the number of units in hidden layers.

In your case, you will need a model with two or more inputs:

One input for the comments, passing through the embedding and processing text
Another input for the rest of the data, passing probably through a standard netork.

At some point you will concatenate these two branches and keep on going.

This link has a good tutorial about the functional API models and shows a model that has two text inputs and an extra input: https://www.tensorflow.org/guide/keras/functional