I was reading the renowned paper 'Attention is all you need'. Though I am clear with most of the major concepts, got buggy with a few points
Any help is appreciated
The Encoder passes the 'Attention' matrix calculated. This attention matrix is considered as the 'Key' & 'Value' matrix for the Decoder Multi-Head Attention module
Why do we need shifted output for testing? It is not required as when testing, we need to predict from token one for which 'BOS' (Beginning Of Sequence) token is considered as past token & hence automatically left-shifted
Yes, we need to iterate over & over predicting one token at a time. If the predicted token is 'EOS' (End Of Sequence), we stop
This isn't clear but looks like the Decoder's multi-head attention isn't trained