Search code examples
machine-learningneural-networkdeep-learningnlprecurrent-neural-network

If a small neural network were used as a scoring function for Attention model, what label/value it is trained against?


I am reading up a paper on Attention mechanism of encoder-decoder architecture for machine translation. There were several proposals for the scoring function for decoding step, such as cosine similarity (between an encoder state and a decoder state), simple dot-product, etc... One of them is using a neural network to train to get a score. But what i don't get is what are we going to train it against? By that I mean the output "Y" label/value. The equation for the network is given below.

score(s,h)= v . tanh(W[s;h])

https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html


Solution

  • The neural network that is used for attention is not something that is trained separately. To interpret in simpler words, tanh(W[s;h])(what paper mentions as neural net) is a feedforward layer that is trained along with the encoder and decoder together.

    Any attention mechanism comes up with a weighting scheme to choose and combine the appropriate encoder states for a particular decoding step. Assume the encoder's output as a1, a2, .., an. For the decoder at every step, a weighted combination of the encoder states is given as input. An attention score gives the appropriate weights α1, α2, .., αn at every decoder step. Hence, say to get a decoder output d1, the input would be a1 * α1 + a2 * α2 + .. + an * αn.

    The weights α1, .. are obtained by a softmax on the outputs of the attention layer/net, in your case the tanh. In this case, the weights of tanh are learned, ie., the backprop and gradient update of tanh is done along with the entire encoder-decoder network.