tensorflow machine-learning keras deep-learning lstm

How to mask paddings in LSTM model for speech emotion recognition

Given a few directories of .wav audio files, I have extracted their features in terms of a 3D array (batch, step, features).

For my case, the training dataset is (1883,100,136). Basically, each audio has been analyzed 100 times (imagine that as 1fps) and each time, 136 features have been extracted. However, those audio files are different in length so some of them cannot be analyzed for 100 times.

For instance, one of the audio has 50 sets of 136 features as effective values so the rest 50 sets were padded with zeros.

Here is my model.

def LSTM_model_building(units=200,learning_rate=0.005,epochs=20,dropout=0.19,recurrent_dropout=0.2):
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Bidirectional(LSTM(units, dropout=dropout, recurrent_dropout=recurrent_dropout, input_shape=(X_train.shape[0],100, 136))))
#     model.add(tf.keras.layers.Bidirectional(LSTM(32)))
    model.add(Dense(num_classes, activation='softmax'))
    
    adamopt = tf.keras.optimizers.Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
    opt = tf.keras.optimizers.RMSprop(lr=learning_rate, rho=0.9, epsilon=1e-6)
#     opt = tf.keras.optimizers.SGD(lr=learning_rate, momentum=0.9, decay=0., nesterov=False)
    
    model.compile(loss='categorical_crossentropy',
                  optimizer=adamopt,
                  metrics=['accuracy'])

    history = model.fit(X_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(X_test, y_test),
                        verbose = 1)

    score, acc = model.evaluate(X_test, y_test,
                                batch_size=batch_size)
    return history

I wish to mask the padding however the instruction, shown on the Keras website, uses an embedding layer which I believe is usually used for NLP. I have no idea how to use the embedding layer for my model.

Can anyone teach me how to apply masking for my LSTM model?

Solution

Embedding layer is not for your case. You can consider instead Masking layer. It is simply integrable in your model structure, as shown below.

I also remember you that the input shape must be specified in the first layer of a sequential model. Remember also that you don't need to pass the sample dimension. In your case, the input shape is (100,136) which is equal to (timesteps,n_features)

units,learning_rate,dropout,recurrent_dropout = 200,0.005,0.19,0.2
num_classes = 3

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Masking(mask_value=0.0, input_shape=(100,136)))
model.add(tf.keras.layers.Bidirectional(LSTM(units, dropout=dropout, recurrent_dropout=recurrent_dropout)))
model.add(Dense(num_classes, activation='softmax'))

adamopt = tf.keras.optimizers.Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
opt = tf.keras.optimizers.RMSprop(lr=learning_rate, rho=0.9, epsilon=1e-6)

model.compile(loss='categorical_crossentropy',
              optimizer=adamopt,
              metrics=['accuracy'])

model.summary()