machine-learning neural-network deep-learning nlp recurrent-neural-network

If a small neural network were used as a scoring function for Attention model, what label/value it is trained against?

I am reading up a paper on Attention mechanism of encoder-decoder architecture for machine translation. There were several proposals for the scoring function for decoding step, such as cosine similarity (between an encoder state and a decoder state), simple dot-product, etc... One of them is using a neural network to train to get a score. But what i don't get is what are we going to train it against? By that I mean the output "Y" label/value. The equation for the network is given below.

score(s,h)= v . tanh(W[s;h])

https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Solution

The neural network that is used for attention is not something that is trained separately. To interpret in simpler words, tanh(W[s;h])(what paper mentions as neural net) is a feedforward layer that is trained along with the encoder and decoder together.

Any attention mechanism comes up with a weighting scheme to choose and combine the appropriate encoder states for a particular decoding step. Assume the encoder's output as a₁, a₂, .., a_n. For the decoder at every step, a weighted combination of the encoder states is given as input. An attention score gives the appropriate weights α₁, α₂, .., α_n at every decoder step. Hence, say to get a decoder output d₁, the input would be a₁ * α₁ + a₂ * α₂ + .. + a_n * α_n.

The weights α₁, .. are obtained by a softmax on the outputs of the attention layer/net, in your case the tanh. In this case, the weights of tanh are learned, ie., the backprop and gradient update of tanh is done along with the entire encoder-decoder network.