Currently I'am training my Word2Vec + CNN for Twitter sentiment analysis about COVID-19 vaccine domain. I used the pre-trained GoogleNewsVectorNegative300 word embedding. The problem is why I heavily overfit on training proses. The reason I used the pre-trained GoogleNewsVectorNegative300 because the performance much worse when I trained my own Word2Vec using own dataset. Here several processes that I have done before fitting the model:
Text Pre processing:
I split my dataset into 90:10 for train:test as follows:
def split_data(X, y):
X_train, X_test, y_train, y_test = train_test_split(X,
y,
train_size=0.9,
test_size=0.1,
stratify=y,
random_state=0)
return X_train, X_test, y_train, y_test
The split data resulting in training has 2060 samples with 708 positive sentiment class, 837 negative sentiment class, and 515 sentiment neutral class Training: Testing:
Then, I implemented the text augmentation that is EDA (Easy Data Augmentation) on all the training data as follows:
class TextAugmentation:
def __init__(self):
self.augmenter = EDA()
def replace_synonym(self, text):
augmented_text_portion = int(len(text)*0.1)
synonym_replaced = self.augmenter.synonym_replacement(text, n=augmented_text_portion)
return synonym_replaced
def random_insert(self, text):
augmented_text_portion = int(len(text)*0.1)
random_inserted = self.augmenter.random_insertion(text, n=augmented_text_portion)
return random_inserted
def random_swap(self, text):
augmented_text_portion = int(len(text)*0.1)
random_swaped = self.augmenter.random_swap(text, n=augmented_text_portion)
return random_swaped
def random_delete(self, text):
random_deleted = self.augmenter.random_deletion(text, p=0.5)
return random_deleted
text_augmentation = TextAugmentation()
The data augmentation resulting in training has 10300 samples with 3540 positive sentiment class, 4185 negative sentiment class, and 2575 sentiment neutral class
Then, I tokenized the sequence as follows:
# Tokenize the sequence
pfizer_tokenizer = Tokenizer(oov_token='OOV')
pfizer_tokenizer.fit_on_texts(df_pfizer_train['text'].values)
X_pfizer_train_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_train['text'].values)
X_pfizer_test_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_test['text'].values)
# Pad the sequence
X_pfizer_train_padded = pad_sequences(X_pfizer_train_tokenized, maxlen=100)
X_pfizer_test_padded = pad_sequences(X_pfizer_test_tokenized, maxlen=100)
pfizer_max_length = 100
pfizer_num_words = len(pfizer_tokenizer.word_index) + 1
# Encode label
y_pfizer_train_encoded = df_pfizer_train['sentiment'].factorize()[0]
y_pfizer_test_encoded = df_pfizer_test['sentiment'].factorize()[0]
y_pfizer_train_category = to_categorical(y_pfizer_train_encoded)
y_pfizer_test_category = to_categorical(y_pfizer_test_encoded)
Resulting in 8869 unique words and 100 maximum sequence length
Finally, I fit the into my model using pre trained GoogleNewsVectorNegative300 word embedding and CNN, and I split my training data again with 10% for validation as follows:
# Build single CNN model
def build_cnn_model(embedding_matrix, max_sequence_length):
# Input layer
input_layer = Input(shape=(max_sequence_length,))
# Word embedding layer
embedding_layer = Embedding(input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=max_sequence_length,
trainable=True)(input_layer)
# CNN model layer
cnn_layer = Conv1D(filters=256,
kernel_size=2,
strides=1,
padding='valid',
activation='relu')(embedding_layer)
cnn_layer = MaxPooling1D(pool_size=2)(cnn_layer)
cnn_layer = Dropout(rate=0.5)(cnn_layer)
batch_norm_layer = BatchNormalization()(cnn_layer)
cnn_layer = Conv1D(filters=256,
kernel_size=2,
strides=1,
padding='valid',
activation='relu')(batch_norm_layer)
cnn_layer = MaxPooling1D(pool_size=2)(cnn_layer)
cnn_layer = Dropout(rate=0.5)(cnn_layer)
batch_norm_layer = BatchNormalization()(cnn_layer)
cnn_layer = Conv1D(filters=256,
kernel_size=2,
strides=1,
padding='valid',
activation='relu')(batch_norm_layer)
cnn_layer = MaxPooling1D(pool_size=2)(cnn_layer)
cnn_layer = Dropout(rate=0.5)(cnn_layer)
batch_norm_layer = BatchNormalization()(cnn_layer)
flatten = Flatten()(batch_norm_layer)
# Dense model layer
dense_layer = Dense(units=10, activation='relu')(flatten)
batch_norm_layer = BatchNormalization()(dense_layer)
output_layer = Dense(units=3, activation='softmax')(batch_norm_layer)
cnn_model = Model(inputs=input_layer, outputs=output_layer)
return cnn_model
return lstm_model
sinovac_cnn_history = sinovac_cnn_model.fit(x=X_sinovac_train,
y=y_sinovac_train,
batch_size=128,
epochs=100,
validation_split=0.1,
verbose=1)
I really need some suggestions or insights because I have been doing this without any performance progress to my model
That's quite a complex problem. It sure looks like overfitting as you said yourself. Meaning the model can't generalize well from your training set to new data.
Intuitively, I would suggest for you to cycle hyperparameters (epochs, batch size, learning rate, dropout layers), if you didn't already, to seek a better combination. Also, I would suggest to use cross-validation to get a better idea of the performance of your classifier. This would also shuffle the training data and avoid that the model learns the data by heart.
Have you tried classifying the original data without the data augmentation? It's not a lot of data, but it could be enough to see if the performance on the test set is better than the final version, and thus see whether the data augmentation might be screwing something up in your data.
Did you try another embedding? I don't really think this is the problem, but in the search for the error I would probably switch it to see what happens.
Last but not least, do you know for a fact that this model structure can handle this task? Meaning did you find a working example somewhere? It sure sounds like it could do it, but there is the chance that the CNN model for example just doesn't generalize well over the embeddings. Have you considered using a model specified on text classification, like a Transformer or an LSTM?