python machine-learning keras lstm gensim

LSTM Model Validation Accuracy not following Training Accuracy

I'm building a LSTM model to classify homophobic twitters in PT-BR. I have a dataset with 5k tweets already balanced in homophobic and non-homophobic. The thing is, I've already tested with three diffenrent models and all of them the validation accuracy/loss does not follow up the training validation/loss, I would like to know if I'm doing something wrong, or that can be a dataset problem that does not follows a formal writing.

Params

dim =300
epochs = 100
lstm = 150
window = 7

Word2Vec

w2v_model = gensim.models.word2vec.Word2Vec(vector_size=dim, 
                                            window=window, 
                                            min_count=10, 
                                            workers=8)

Model

model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.5))
model.add(LSTM(lstm, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(loss='binary_crossentropy',
              optimizer="adam",
              metrics=['accuracy'])

Training Example

model_history=model.fit(X_train, y_train,epochs=epochs,validation_split=0.1,verbose=1)

Results

Accuracy Loss

Solution

It's unclear why you're instantiating a Gensim Word2Vec model, as it isn't shown to be trained or used by your other code.

But: if you have a mere 5000 tweets – often averaging less than 20 words per tweet – then you've only got about 100,000 training words. That's way too little to train a Word2Vec model from scratch, much less one with a full 300 dimensions per word.

Most generally, when a model shows good performance (like high accuracy) on the data used for training, but then poor performance on held-out validation data, one common cause is 'overfitting'.

Essentially, the model is memorizing learning to give answers based on details in your training data that are, in generalizable truth, irrelevant – but nonetheless turn out to have memorizable but unhelpful correlations with the desired answers strictly within the limited training data. Trying to apply those same misleading correlations on out-of-training data ruins model performance.

Far more data often helps: the nonsense correlations may cancel out. Using a smaller model that's forced to only notice stronger (& thus potentially more-reliable) correlations may also help.