Invalid argument: indices[0,0] = -4 is not in [0, 40405)

I have a model that was kinda working on some data. I've added in some tokenized word data in the dataset (somewhat truncated for brevity):

vocab_size = len(tokenizer.word_index) + 1
comment_texts = df.comment_text.values

tokenizer = Tokenizer(num_words=num_words)

comment_seq = tokenizer.texts_to_sequences(comment_texts)
maxtrainlen = max_length(comment_seq)
comment_train = pad_sequences(comment_seq, maxlen=maxtrainlen, padding='post')
df.comment_text = comment_train

x = df.drop('label', 1) # the thing I'm training

labels = df['label'].values  # Also known as Y

x_train, x_test, y_train, y_test = train_test_split(
    x, labels, test_size=0.2, random_state=1337)        

n_cols = x_train.shape[1]

embedding_dim = 100  # TODO: why?

model = Sequential([
            Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_shape=(n_cols,)),
            Dense(32, activation='relu'),
            Dense(512, activation='relu'),
            Dense(12, activation='softmax'),  # for an unknown type, we don't account for that while training


# convert the y_train to a one hot encoded variable
encoder = LabelEncoder()  # fit on all the labels
encoded_Y = encoder.transform(y_train)  # encode on y_train
one_hot_y = np_utils.to_categorical(encoded_Y), one_hot_y, epochs=10, batch_size=16)

Now, I get this error:

Model: "sequential"
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 12, 100)           4040500   
lstm (LSTM)                  (None, 32)                17024     
dense (Dense)                (None, 32)                1056      
dense_1 (Dense)              (None, 512)               16896     
dense_2 (Dense)              (None, 12)                6156      
Total params: 4,081,632
Trainable params: 4,081,632
Non-trainable params: 0
Train on 4702 samples
Epoch 1/10
2020-03-04 22:37:59.499238: W tensorflow/core/common_runtime/] BaseCollectiveExecutor::StartAbort Invalid argument: indices[0,0] = -4 is not in [0, 40405)

I think this must be coming from my comment_text column since that is the only thing I added.

Here is what comment_text looks like before I make the substitution: before

And here is after: after

My full code (before I made the change) is here:


  • You should be training with comment_train, not with x which is taking whatever is in the unknown df.

    The embedding_dim=100 is free to choose. It's like the number of units in a hidden layer. You can tune this parameter to find which is best for your model as well as you can tune the number of units in hidden layers.

    In your case, you will need a model with two or more inputs:

    • One input for the comments, passing through the embedding and processing text
    • Another input for the rest of the data, passing probably through a standard netork.

    At some point you will concatenate these two branches and keep on going.

