I have a model that was kinda working on some data. I've added in some tokenized word data in the dataset (somewhat truncated for brevity):
vocab_size = len(tokenizer.word_index) + 1
comment_texts = df.comment_text.values
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(comment_texts)
comment_seq = tokenizer.texts_to_sequences(comment_texts)
maxtrainlen = max_length(comment_seq)
comment_train = pad_sequences(comment_seq, maxlen=maxtrainlen, padding='post')
vocab_size = len(tokenizer.word_index) + 1
df.comment_text = comment_train
x = df.drop('label', 1) # the thing I'm training
labels = df['label'].values # Also known as Y
x_train, x_test, y_train, y_test = train_test_split(
x, labels, test_size=0.2, random_state=1337)
n_cols = x_train.shape[1]
embedding_dim = 100 # TODO: why?
model = Sequential([
Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_shape=(n_cols,)),
LSTM(32),
Dense(32, activation='relu'),
Dense(512, activation='relu'),
Dense(12, activation='softmax'), # for an unknown type, we don't account for that while training
])
model.summary()
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['acc'])
# convert the y_train to a one hot encoded variable
encoder = LabelEncoder()
encoder.fit(labels) # fit on all the labels
encoded_Y = encoder.transform(y_train) # encode on y_train
one_hot_y = np_utils.to_categorical(encoded_Y)
model.fit(x_train, one_hot_y, epochs=10, batch_size=16)
Now, I get this error:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 12, 100) 4040500
_________________________________________________________________
lstm (LSTM) (None, 32) 17024
_________________________________________________________________
dense (Dense) (None, 32) 1056
_________________________________________________________________
dense_1 (Dense) (None, 512) 16896
_________________________________________________________________
dense_2 (Dense) (None, 12) 6156
=================================================================
Total params: 4,081,632
Trainable params: 4,081,632
Non-trainable params: 0
_________________________________________________________________
Train on 4702 samples
Epoch 1/10
2020-03-04 22:37:59.499238: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Invalid argument: indices[0,0] = -4 is not in [0, 40405)
I think this must be coming from my comment_text column since that is the only thing I added.
Here is what comment_text looks like before I make the substitution:
My full code (before I made the change) is here: https://colab.research.google.com/drive/1y8Lhxa_DROZg0at3VR98fi5WCcunUhyc#scrollTo=hpEoqR4ne9TO
You should be training with comment_train
, not with x
which is taking whatever is in the unknown df
.
The embedding_dim=100
is free to choose. It's like the number of units in a hidden layer. You can tune this parameter to find which is best for your model as well as you can tune the number of units in hidden layers.
In your case, you will need a model with two or more inputs:
At some point you will concatenate these two branches and keep on going.
This link has a good tutorial about the functional API models and shows a model that has two text inputs and an extra input: https://www.tensorflow.org/guide/keras/functional