python machine-learning nlp huggingface-transformers transformer-model

How to calculate word and sentence embedding using GPT-2?

I'm working on a program that calculates word and sentence embeddings using GPT-2, specifically the GPT2Model class. For word embedding, I extract the last hidden state outputs[0] after forwarding the input_ids, that has a shape of batch size x seq len, to the GPT2Model class. As for sentence embedding, I extract the hidden state of the word at the end of sequence. This is the code I have tried:

from transformers import GPT2Tokenizer, GPT2Model
import torch

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
captions = ["example caption", "example bird", "the bird is yellow has red wings", "hi", "very good"]

encoded_captions = [tokenizer.encode(caption) for caption in captions]

# Pad sequences to the same length with 0s
max_len = max(len(seq) for seq in encoded_captions)
padded_captions = [seq + [0] * (max_len - len(seq)) for seq in encoded_captions]

# Convert to a PyTorch tensor with batch size 5
input_ids = torch.tensor(padded_captions)

outputs = model(input_ids)
word_embedding = outputs[0].contiguous()
sentence_embedding = word_embedding[ :, -1, : ].contiguous()

I'm not sure if my calculation for word and sentence embedding are correct, can anyone help me confirm this?

Solution

Here is your modified code to compute sentence and word embeddings:

from transformers import GPT2Tokenizer, GPT2Model
import torch

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2Model.from_pretrained('gpt2')
captions = [
    "example caption",
    "example bird",
    "the bird is yellow has red wings",
    "hi",
    "very good"
]

# Tokenize and pad sequences
encoded_captions = tokenizer(
    captions,
    return_tensors='pt',
    padding=True,
    truncation=True
)
input_ids = encoded_captions['input_ids']

# Forward pass to get embeddings
with torch.no_grad():
    outputs = model(input_ids)

# Extract embeddings
word_embeddings = outputs.last_hidden_state

# Mask to ignore padding tokens
masked_word_embeddings = word_embeddings * encoded_captions.attention_mask.unsqueeze(-1).float()

# Sum pooling considering only non-padding tokens
sentence_embeddings = masked_word_embeddings.sum(dim=1)

# Normalize by the count of non-padding tokens
sentence_embeddings /= encoded_captions.attention_mask.sum(dim=1, keepdim=True).float()

Some relevant facts:

As you said, word embeddings are the last hidden output. If you print the out put you see 5 vectors (number of sentences) of length 7 (maximum number of tokens in the list of sentences) and shape 768 (model dimension).

word_embeddings.shape
>> torch.Size([5, 7, 768])

It means that some sentences have embeddings for non existent tokens, so we need to mask the output to consider only existent tokens

Mask consists on multiplying by zero (or whatever special value but zero is the more accepted and useful, as it nulls values) on non existent token places of the word vector. The attention mask is crucial for handling variable-length sequences and ensuring that padding tokens do not contribute to the embeddings.

print(masked_word_embeddings)
>> tensor([[[-0.2835, -0.0469, -0.5029,  ..., -0.0525, -0.0089, -0.1395],
         [-0.2636, -0.1355, -0.4277,  ..., -0.3552,  0.0437, -0.2479],
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000],
         ...,
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000],
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000],
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000]],
...

Usually, sentence embeddings are computed as the sum, mean or max of the masked word embeddings. It depends on your use case.

Mean is more suitable to variable length:

sentence_embeddings = masked_word_embeddings.mean(dim=1)

Sum is intended to force importance on relevant parts:

sentence_embeddings = masked_word_embeddings.max(dim=1)

It exist a lot of techniques and it depends on how embeddings perform for your task. I would choose a method that maximices the cosine similarity between vectors I consider similar for my task. Ex: If the sum gets more similarity than mean, it may be more suitable.

Additionally I suggest you to normalize values by the number of tokens in the sentence. So that, with that normalization, larger sentences tend to have lower vector values. It is to embed information on the number of tokens in the sentence. It prevents to get high similarity scores between a sentence of 4 tokens with a whole book, that its meaningless.