I am doing a speech emotion recognition machine training.
I wish to apply an attention layer to the model. The instruction page is hard to understand.
def bi_duo_LSTM_model(X_train, y_train, X_test,y_test,num_classes,batch_size=68,units=128, learning_rate=0.005, epochs=20, dropout=0.2, recurrent_dropout=0.2):
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if (logs.get('acc') > 0.95):
print("\nReached 99% accuracy so cancelling training!")
self.model.stop_training = True
callbacks = myCallback()
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Masking(mask_value=0.0, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(tf.keras.layers.Bidirectional(LSTM(units, dropout=dropout, recurrent_dropout=recurrent_dropout,return_sequences=True)))
model.add(tf.keras.layers.Bidirectional(LSTM(units, dropout=dropout, recurrent_dropout=recurrent_dropout)))
# model.add(tf.keras.layers.Bidirectional(LSTM(32)))
model.add(Dense(num_classes, activation='softmax'))
adamopt = tf.keras.optimizers.Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
RMSopt = tf.keras.optimizers.RMSprop(lr=learning_rate, rho=0.9, epsilon=1e-6)
SGDopt = tf.keras.optimizers.SGD(lr=learning_rate, momentum=0.9, decay=0.1, nesterov=False)
model.compile(loss='binary_crossentropy',
optimizer=adamopt,
metrics=['accuracy'])
history = model.fit(X_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(X_test, y_test),
verbose=1,
callbacks=[callbacks])
score, acc = model.evaluate(X_test, y_test,
batch_size=batch_size)
yhat = model.predict(X_test)
return history, yhat
How can I apply it to fit for my model?
And are use_scale
, causal
and dropout
all the arguments?
If there is a dropout
in attention layer
, how do we deal with it since we have dropout
in LSTM layer?
Attention can be interpreted as a soft vector retrieval.
You have some query vectors. For each query, you want to retrieve some
values, such that you compute a weighted of them,
where the weights are obtained by comparing a query with keys (the number of keys must the be same as the number of values and often they are the same vectors).
In sequence-to-sequence models, the query is the decoder state and keys and values are the decoder states.
In classification task, you do not have such an explicit query. The easiest way how to get around this is training a "universal" query that is used to collect relevant information from the hidden states (something similar to what was originally described in this paper).
If you approach the problem as sequence labeling, assigning a label not to an entire sequence, but to individual time steps, you might want to use a self-attentive layer instead.