I'm building a LSTM model to classify homophobic twitters in PT-BR. I have a dataset with 5k tweets already balanced in homophobic and non-homophobic. The thing is, I've already tested with three diffenrent models and all of them the validation accuracy/loss does not follow up the training validation/loss, I would like to know if I'm doing something wrong, or that can be a dataset problem that does not follows a formal writing.
dim =300
epochs = 100
lstm = 150
window = 7
w2v_model = gensim.models.word2vec.Word2Vec(vector_size=dim,
model = Sequential()
model.add(LSTM(lstm, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model_history=model.fit(X_train, y_train,epochs=epochs,validation_split=0.1,verbose=1)
It's unclear why you're instantiating a Gensim Word2Vec
model, as it isn't shown to be trained or used by your other code.
But: if you have a mere 5000 tweets – often averaging less than 20 words per tweet – then you've only got about 100,000 training words. That's way too little to train a Word2Vec
model from scratch, much less one with a full 300 dimensions per word.
Most generally, when a model shows good performance (like high accuracy) on the data used for training, but then poor performance on held-out validation data, one common cause is 'overfitting'.
Essentially, the model is memorizing learning to give answers based on details in your training data that are, in generalizable truth, irrelevant – but nonetheless turn out to have memorizable but unhelpful correlations with the desired answers strictly within the limited training data. Trying to apply those same misleading correlations on out-of-training data ruins model performance.
Far more data often helps: the nonsense correlations may cancel out. Using a smaller model that's forced to only notice stronger (& thus potentially more-reliable) correlations may also help.