python keras nlp conv-neural-network word2vec

Word2Vec + CNN Overfitting

Currently I'am training my Word2Vec + CNN for Twitter sentiment analysis about COVID-19 vaccine domain. I used the pre-trained GoogleNewsVectorNegative300 word embedding. The problem is why I heavily overfit on training proses. The reason I used the pre-trained GoogleNewsVectorNegative300 because the performance much worse when I trained my own Word2Vec using own dataset. Here several processes that I have done before fitting the model:

Text Pre processing:

Lower casing
Remove hashtag, mentions, URLs, numbers, change words to numbers, non-ASCII characters, retweets "RT"
Expand contractions
Replace negations with antonyms
Remove puncutations
Remove stopwords
Lemmatization

I split my dataset into 90:10 for train:test as follows:

def split_data(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, 
                                                        y,
                                                        train_size=0.9, 
                                                        test_size=0.1, 
                                                        stratify=y,
                                                        random_state=0)
    return X_train, X_test, y_train, y_test

The split data resulting in training has 2060 samples with 708 positive sentiment class, 837 negative sentiment class, and 515 sentiment neutral class Training: Testing:

Then, I implemented the text augmentation that is EDA (Easy Data Augmentation) on all the training data as follows:

class TextAugmentation:
    def __init__(self):
        self.augmenter = EDA()

    def replace_synonym(self, text):
        augmented_text_portion = int(len(text)*0.1) 
        synonym_replaced = self.augmenter.synonym_replacement(text, n=augmented_text_portion)
        return synonym_replaced

    def random_insert(self, text):
        augmented_text_portion = int(len(text)*0.1) 
        random_inserted = self.augmenter.random_insertion(text, n=augmented_text_portion)
        return random_inserted

    def random_swap(self, text):
        augmented_text_portion = int(len(text)*0.1)
        random_swaped = self.augmenter.random_swap(text, n=augmented_text_portion)
        return random_swaped

    def random_delete(self, text):
        random_deleted = self.augmenter.random_deletion(text, p=0.5)
        return random_deleted

text_augmentation = TextAugmentation()

The data augmentation resulting in training has 10300 samples with 3540 positive sentiment class, 4185 negative sentiment class, and 2575 sentiment neutral class

Then, I tokenized the sequence as follows:

# Tokenize the sequence
pfizer_tokenizer = Tokenizer(oov_token='OOV')
pfizer_tokenizer.fit_on_texts(df_pfizer_train['text'].values)

X_pfizer_train_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_train['text'].values)
X_pfizer_test_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_test['text'].values)

# Pad the sequence
X_pfizer_train_padded = pad_sequences(X_pfizer_train_tokenized, maxlen=100)
X_pfizer_test_padded = pad_sequences(X_pfizer_test_tokenized, maxlen=100)

pfizer_max_length = 100
pfizer_num_words = len(pfizer_tokenizer.word_index) + 1

# Encode label
y_pfizer_train_encoded = df_pfizer_train['sentiment'].factorize()[0]
y_pfizer_test_encoded = df_pfizer_test['sentiment'].factorize()[0]

y_pfizer_train_category = to_categorical(y_pfizer_train_encoded)
y_pfizer_test_category = to_categorical(y_pfizer_test_encoded)

Resulting in 8869 unique words and 100 maximum sequence length

Finally, I fit the into my model using pre trained GoogleNewsVectorNegative300 word embedding and CNN, and I split my training data again with 10% for validation as follows:

# Build single CNN model
def build_cnn_model(embedding_matrix, max_sequence_length):
    # Input layer
    input_layer = Input(shape=(max_sequence_length,))

    # Word embedding layer
    embedding_layer = Embedding(input_dim=embedding_matrix.shape[0],
                                output_dim=embedding_matrix.shape[1],
                                weights=[embedding_matrix],
                                input_length=max_sequence_length,
                                trainable=True)(input_layer)

    # CNN model layer
    cnn_layer = Conv1D(filters=256,
                        kernel_size=2,
                        strides=1,
                        padding='valid',
                        activation='relu')(embedding_layer)
    cnn_layer = MaxPooling1D(pool_size=2)(cnn_layer)
    cnn_layer = Dropout(rate=0.5)(cnn_layer)
    batch_norm_layer = BatchNormalization()(cnn_layer)

    
    cnn_layer = Conv1D(filters=256,
                        kernel_size=2,
                        strides=1,
                        padding='valid',
                        activation='relu')(batch_norm_layer)
    cnn_layer = MaxPooling1D(pool_size=2)(cnn_layer)
    cnn_layer = Dropout(rate=0.5)(cnn_layer)
    batch_norm_layer = BatchNormalization()(cnn_layer)

    
    cnn_layer = Conv1D(filters=256,
                        kernel_size=2,
                        strides=1,
                        padding='valid',
                        activation='relu')(batch_norm_layer)
    cnn_layer = MaxPooling1D(pool_size=2)(cnn_layer)
    cnn_layer = Dropout(rate=0.5)(cnn_layer)
    batch_norm_layer = BatchNormalization()(cnn_layer)


    flatten = Flatten()(batch_norm_layer)
    
    # Dense model layer
    dense_layer = Dense(units=10, activation='relu')(flatten)
    batch_norm_layer = BatchNormalization()(dense_layer)
    output_layer = Dense(units=3, activation='softmax')(batch_norm_layer)
  
    cnn_model = Model(inputs=input_layer, outputs=output_layer)
  
    return cnn_model

    return lstm_model

sinovac_cnn_history = sinovac_cnn_model.fit(x=X_sinovac_train,
                                                  y=y_sinovac_train,
                                                  batch_size=128,
                                                  epochs=100,
                                                  validation_split=0.1,
                                                  verbose=1)

The training result:

I really need some suggestions or insights because I have been doing this without any performance progress to my model

Solution

That's quite a complex problem. It sure looks like overfitting as you said yourself. Meaning the model can't generalize well from your training set to new data.

Intuitively, I would suggest for you to cycle hyperparameters (epochs, batch size, learning rate, dropout layers), if you didn't already, to seek a better combination. Also, I would suggest to use cross-validation to get a better idea of the performance of your classifier. This would also shuffle the training data and avoid that the model learns the data by heart.

Have you tried classifying the original data without the data augmentation? It's not a lot of data, but it could be enough to see if the performance on the test set is better than the final version, and thus see whether the data augmentation might be screwing something up in your data.

Did you try another embedding? I don't really think this is the problem, but in the search for the error I would probably switch it to see what happens.

Last but not least, do you know for a fact that this model structure can handle this task? Meaning did you find a working example somewhere? It sure sounds like it could do it, but there is the chance that the CNN model for example just doesn't generalize well over the embeddings. Have you considered using a model specified on text classification, like a Transformer or an LSTM?