Search code examples
machine-learningnlpartificial-intelligencehuggingface-transformersattention-model

How Encoder passes Attention Matrix to Decoder in Tranformers 'Attention is all you need'?


I was reading the renowned paper 'Attention is all you need'. Though I am clear with most of the major concepts, got buggy with a few points enter image description here

  1. How Encoder pass the attention matrix calculated using the input to Decoder? Like what I understood is it only passes the Key & Value matrix to the decoder
  2. From where do we get shifted output for the decoder while testing?
  3. As it is able to output just one token at a time, is this transformer run for multiple iterations to generate output sequence. If yes, then, how to know when to stop?
  4. Are weights trained in Multi-Head Attention in the decoder as it already gets Q,K & V from encoder & masked multi-head attention

Any help is appreciated


Solution

    1. The Encoder passes the 'Attention' matrix calculated. This attention matrix is considered as the 'Key' & 'Value' matrix for the Decoder Multi-Head Attention module

    2. Why do we need shifted output for testing? It is not required as when testing, we need to predict from token one for which 'BOS' (Beginning Of Sequence) token is considered as past token & hence automatically left-shifted

    3. Yes, we need to iterate over & over predicting one token at a time. If the predicted token is 'EOS' (End Of Sequence), we stop

    4. This isn't clear but looks like the Decoder's multi-head attention isn't trained