I'm working on a program that calculates word and sentence embeddings using GPT-2, specifically the GPT2Model
class. For word embedding, I extract the last hidden state outputs[0]
after forwarding the input_ids
, that has a shape of batch size x seq len
, to the GPT2Model
class. As for sentence embedding, I extract the hidden state of the word at the end of sequence. This is the code I have tried:
from transformers import GPT2Tokenizer, GPT2Model
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
captions = ["example caption", "example bird", "the bird is yellow has red wings", "hi", "very good"]
encoded_captions = [tokenizer.encode(caption) for caption in captions]
# Pad sequences to the same length with 0s
max_len = max(len(seq) for seq in encoded_captions)
padded_captions = [seq + [0] * (max_len - len(seq)) for seq in encoded_captions]
# Convert to a PyTorch tensor with batch size 5
input_ids = torch.tensor(padded_captions)
outputs = model(input_ids)
word_embedding = outputs[0].contiguous()
sentence_embedding = word_embedding[ :, -1, : ].contiguous()
I'm not sure if my calculation for word and sentence embedding are correct, can anyone help me confirm this?
Here is your modified code to compute sentence and word embeddings:
from transformers import GPT2Tokenizer, GPT2Model
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2Model.from_pretrained('gpt2')
captions = [
"example caption",
"example bird",
"the bird is yellow has red wings",
"hi",
"very good"
]
# Tokenize and pad sequences
encoded_captions = tokenizer(
captions,
return_tensors='pt',
padding=True,
truncation=True
)
input_ids = encoded_captions['input_ids']
# Forward pass to get embeddings
with torch.no_grad():
outputs = model(input_ids)
# Extract embeddings
word_embeddings = outputs.last_hidden_state
# Mask to ignore padding tokens
masked_word_embeddings = word_embeddings * encoded_captions.attention_mask.unsqueeze(-1).float()
# Sum pooling considering only non-padding tokens
sentence_embeddings = masked_word_embeddings.sum(dim=1)
# Normalize by the count of non-padding tokens
sentence_embeddings /= encoded_captions.attention_mask.sum(dim=1, keepdim=True).float()
Some relevant facts:
word_embeddings.shape
>> torch.Size([5, 7, 768])
It means that some sentences have embeddings for non existent tokens, so we need to mask the output to consider only existent tokens
print(masked_word_embeddings)
>> tensor([[[-0.2835, -0.0469, -0.5029, ..., -0.0525, -0.0089, -0.1395],
[-0.2636, -0.1355, -0.4277, ..., -0.3552, 0.0437, -0.2479],
[ 0.0000, -0.0000, 0.0000, ..., 0.0000, -0.0000, -0.0000],
...,
[ 0.0000, -0.0000, 0.0000, ..., 0.0000, -0.0000, -0.0000],
[ 0.0000, -0.0000, 0.0000, ..., 0.0000, -0.0000, -0.0000],
[ 0.0000, -0.0000, 0.0000, ..., 0.0000, -0.0000, -0.0000]],
...
sentence_embeddings = masked_word_embeddings.mean(dim=1)
sentence_embeddings = masked_word_embeddings.max(dim=1)
It exist a lot of techniques and it depends on how embeddings perform for your task. I would choose a method that maximices the cosine similarity between vectors I consider similar for my task. Ex: If the sum gets more similarity than mean, it may be more suitable.