Search code examples
machine-learningpytorchhuggingface-transformerslanguage-model

How to get the vector embedding of a token in GPT?


I have a GPT model

model = BioGptForCausalLM.from_pretrained("microsoft/biogpt").to(device)

When I send my batch to it I can get the logits and the hidden states:

out = model(batch["input_ids"].to(device), output_hidden_states=True, return_dict=True)
print(out.keys())
>>> odict_keys(['logits', 'past_key_values', 'hidden_states'])

The logits have shape of

torch.Size([2, 1024, 42386])

Corresponding to (batch, seq_length, vocab_length)

How can I get the vector embedding of the first (i.e., dim=0) token in the last layer (i.e., after the fully connected layer)? I believe it should be of size [2, 1024, 1024]

From here it seems like it should be under last_hidden_state, but I can't seem to generate it. out.hidden_states seems to be a tuple of length 25, where each is of dimension [2, 1024, 1024]. I'm wondering if the last one is the one I'm looking for, but I'm not sure.


Solution

  • You are right with output_hidden_state=True and watching out.hidden_states. This element is a tuple of length 25 as you mentioned. According to BioGPT paper and HuggingFace doc, your model contains 24 transformer layers, and the 25 elements in the tuple are the first embedding layer output and the outputs of each of the 24 layers.

    The shape of each of these tensors is [B, L, E] where B is your batch size, L is the length of the input and E is the dimension of your embedding. It seems that you are padding your input to 1024 regarding the shape you indicated. So, the representation of your first token (in the first batched sentence) would be out.hidden_states[k][0,0,:], which is of shape [1024]. Here, k denotes the layer you want to use and it is up to you to decide which one you want depending on what you will do with it.