Search code examples
deep-learningpytorchgpt-2text-generation

How does GPT-like transformers utilize only the decoder to do sequence generation?


I want to code a GPT-like transformer for a specific text generation task. GPT-like models use only the decoder block (in stacks) [1]. I know how to code all sub-modules of the decoder block shown below (from the embedding to the softmax layer) in Pytorch. However, I don't know what I should give as input. It says (in the figure) "Output shifted right".

enter image description here

For example, this is my data, (where < and > are sos and eos tokens):

  • < abcdefgh >

What should I give to my GPT-like model to train it properly?

Also, since I am not using a encoder, should I still give input to the multihead attention block?

Sorry if my questions seem a little dumb, I am so new to transformers.


Solution

  • The input for a decoder-only model like GPT is typically a sequence of tokens, just like in an encoder-decoder model. However, the difference lies in how the input is processed.

    In an encoder-decoder model, the input sequence is first processed by an encoder component that produces a fixed-size representation of the input, often called the "context vector". The context vector is then used by the decoder component to generate the output sequence.

    In contrast, in a decoder-only model like GPT, there is no separate encoder component. Instead, the input sequence is directly fed into the decoder, which generates the output sequence by attending to the input sequence through self-attention mechanisms.

    In both cases, the input sequence is typically a sequence of tokens that represent the text data being processed. The tokens may be words, subwords, or characters, depending on the specific modeling approach and the granularity of the text data being processed.