python tensorflow machine-learning keras lstm

Validation accuracy is much less than Training accuracy

I am using MOSI dataset for the multimodal sentiment analysis, where for now I am training the model for text dataset only. For text, I am using glove embeddings of 300 dimensions for processing text. My total vocab size is 2173 and my padded sequence length is 30. My target array is [0,0,0,0,0,0,1] where left most is highly -ve and right most highly +ve.

I am splitting the dataset like this

X_train, X_test, y_train, y_test = train_test_split(WDatasetX, y7, test_size=0.20, random_state=42)

My tokenization process is

MAX_NB_WORDS = 3000
tokenizer = Tokenizer(num_words=MAX_NB_WORDS,oov_token = "OOV")
tokenizer.fit_on_texts(Text_X_Train)
tokenized_X_train = tokenizer.texts_to_sequences(Text_X_Train)
tokenized_X_test = tokenizer.texts_to_sequences(Text_X_Test)

My embedding matrix:

vocab_size = len(tokenizer.word_index)+1
emb_mean=0
def embedding_matrix_filteration():
    all_embs = np.stack(list(embeddings_index.values()))
    print(all_embs.shape)
    emb_mean, emb_std = np.mean(all_embs), np.std(all_embs)
    print(emb_mean)
    embedding_matrix = np.random.normal(emb_mean, emb_std, (vocab_size, embed_dim)) gives the matrix of specified
                                                                    size filled with values from gauss distribution
    print(embedding_matrix.shape)
     print("length of word2id:",len(word2id))
    embeddedCount = 0
    not_found = []
    for word, idx in tokenizer.word_index.items():
        embedding_vector = embeddings_index.get(word.lower())
        if word == ' ':
            embedding_vector = np.zeros_like(emb_mean)
        if embedding_vector is not None: 
            embedding_matrix[idx] = embedding_vector
            embeddedCount += 1
        else:
            print(word)
            print("$$$")
    print('total embedded:',embeddedCount,'common words')# words common between glove vector and wordset
    print("length of word2id:",len(word2id))
    print(len(embedding_matrix))
    return embedding_matrix

emb = embedding_matrix_filteration()

Model Architecture:

Embedding Layer:

embedding_layer = Embedding(
    vocab_size,
    300,
    weights=[emb],
    trainable=False,
    input_length=sequence_length
)

My model:

from keras import regularizers,layers

model = Sequential()
model.add(embedding_layer)
model.add(Bidirectional(layers.LSTM(512,return_sequences=True)))
model.add(Bidirectional(layers.LSTM(512,return_sequences=True)))
model.add(Bidirectional(layers.LSTM(256,return_sequences=True)))
model.add(Bidirectional(layers.LSTM(256)))#kernel_regularizer=regularizers.l2(0.001)
model.add(Dense(128, activation='relu'))
# model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
# model.add(Dropout(0.2))
model.add(Dense(7, activation='softmax'))

For some reason when my training accuracy reached 80%, val. accuracy still remains very low. I have tried different regularization techniques, optimizers, loss functions, but the result is the same. I don't know why.

Please Help!!

Edit: The total no. of tokens are 2719 and the total no. of sentences (including test and train dataset ) are 2183.

Compiler: model.compile(optimizer='adam',         
loss='mean-squred-error',
metrics=['accuracy']
)

UPDATED STATS:

I have decreased the label size from 7 to 3 i.e. [0,1,0] -> +ve, neutral ,-ve.

model = Sequential()
model.add(embedding_layer)
model.add(Bidirectional(layers.LSTM(16,activation='relu'))) 
model.add(Dropout(0.2))
model.add(Dense(3, activation='softmax'))

Compiled:

model.compile( 
optimizer=keras.optimizers.Adam(learning_rate=0.00005),
              loss='categorical_crossentropy',
              metrics = ['accuracy'])

Graphs:

Training:

But loss is still high and Also, I have stratified the dataset.

Solution

A couple of recommendations:

Use categorical_crossentropy instead of mean_squared_error, it can help you a lot when doing classification (although the latter could also work, the former also does it better).
Are all your labels mutually exclusive? If then, use softmax + categorical_crossentropy, otherwise (e.g. label appears like [1,0,0,0,0,0,1] use sigmoid + binary_crossentropy.
Decrease the size of the model initially, and only if the overfitting problem persists use Dropout(). Use only one layer of LSTM.
Reduce the number of units (even if you have one single LSTM cell (64/128 would probably suffice).
You can use bidirectional LSTM (I would even opt for bidirectional GRUs since they are simpler, to see how the performance behaves).
Ensure that you do a stratified split (in this way, certain examples definitely appear both in the training set and in the validation set, also keeping a good proportion.
Start with a small(er) learning rate (0.0001/0.00005).
Establish an objective/correct baseline. If your data is very little, particularly when working on a multi-modal dataset(you fetch only the "text"), you work only on text, with 7 different classes, then it is probable you will not reach a very high accuracy.

Bear in mind that, in order to have a reasonable final result in your case, you need to employ a data-centric approach, rather than a model-centric one. Regardless of the possible improvements, if the data is scarce + not comprehensive, you will not be able to achieve great results.