Search code examples
tensorflowkerasattention-modellogits

calculating attention scores in Bahdanau attention in tensorflow using decoder hidden state and encoder output


This question relates to the neural machine translation shown here: Neural Machine Translation

self.W1 and self.W2 are initialized to dense neural layers of 10 units each, in lines 4 and 5 in the __init__ function of class BahdanauAttention

In the code image attached, I am not sure I understand the feed forward neural network set up in line 17 and line 18. So, I broke this formula down into it's parts. See line 23 and line 24.

query_with_time_axis is the input tensor to self.W1 and values is input to self.W2. And each compute the function Z = WX + b, and the Z's are added together. The dimensions of the tensors added together are (64, 1, 10) and (64, 16, 10). I am assuming random weight initialization for both self.W1 and self.W2 is handled by Keras behind the scenes.

Question:

After adding the Z's together, a non-linearity (tanh) is applied to come up with an activation and this resulting activation is input to the next layer self.V, which is a layer with just one output and gives us the score.

For this last step, we don't apply an activation function (tanh etc) to the result of self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values))), to get a single output from this last neural network layer.

Is there a reason why an activation function was not used for this last step?

See line 17, 18, 23 and 24


Solution

  • The ouput of the attention form so-called attention energies, i.e., one scalar for each encoder output. These numbers get stacked into a vector a this vector is normalized using softmax, yielding attention distribution.

    So, in fact, there is non-linearity applied in the next step, which is the softmax. If you used an activation function before the softmax, you would only decrease the space of distributions that the softmax can do.