python machine-learning nlp huggingface-transformers transformer-model

How to calculate word and sentence embedding using Roberta?

I'm trying to calculate word and sentence embeddings using Roberta, for word embeddings, I extract the last hidden state outputs[0] from the RobertaModel class, but I'm not sure if this is the correct way to calculate.

As for sentence embeddings, I don't know how to calculate them, this is the code I have tried:

from transformers import RobertaModel, RobertaTokenizer
import torch

model = RobertaModel.from_pretrained('roberta-base')
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
captions = ["example caption", "lorem ipsum", "this bird is yellow has red wings", "hi", "example"]

encoded_captions = [tokenizer.encode(caption) for caption in captions]

# Pad sequences to the same length with 0s
max_len = max(len(seq) for seq in encoded_captions)
padded_captions = [seq + [0] * (max_len - len(seq)) for seq in encoded_captions]

# Convert to a PyTorch tensor with batch size 5
input_ids = torch.tensor(padded_captions)

outputs = model(input_ids)
word_embedding = outputs[0].contiguous()
sentence_embedding = ?????

How to calculate word and sentence embeddings using Roberta?

Solution

Warning: This answer only shows ways to retrieve word and sentence embeddings from a technical perspective as requested by OP In the comments. The respective embeddings will not be useful from a performance perspective to for example calculate the similarity between two sentences or words. Compare this SO answer for further information.

Word embeddings

It is important to note, that RoBERTa was trained with a byte-level BPE tokenizer. This is a so-called subword tokenizer which means that one word of your input string can be split into several tokens. For example your second caption lorem ipsum:

from transformers import RobertaModel, RobertaTokenizerFast
import torch

m = RobertaModel.from_pretrained('roberta-base')
t = RobertaTokenizerFast.from_pretrained('roberta-base')
captions = ["example caption", "lorem ipsum", "this bird is yellow has red wings", "hi", "example"]

print(t(captions[1]).input_ids)

Output:

[0, 462, 43375, 1437, 7418, 783, 2]

As you can see the two words were mapped to 5 tokens (0 and 2 are special tokens). That means to retrieve the actual word embeddings and not the token embeddings, you need to apply some kind of aggregation. A common approach is applying mean pooling (compare this SO answer). Using the respective fast tokenizer of the model helps you here because it returns a BatchEncoding object that can be used to map the tokens back to the respective words:

# no need to pad manually, the tokenizer can do that for you
tokenized_captions = t(captions, return_tensors='pt', padding='longest')

with torch.inference_mode():
  model_inference_output = m(**tokenized_captions)
  contextualized_token_embeddings = model_inference_output.last_hidden_state

#properly padded
print(contextualized_token_embeddings.shape)

def fetch_word_embeddings(idx, sentence, tokenized_captions, contextualized_token_embeddings):
  word_embeddings = {}
  # fetching word_ids, each id is a word in the original sentence
  word_ids = {i for i in tokenized_captions[idx].word_ids if i is not None}

  for word_id in word_ids:
    token_start, token_end = tokenized_captions[idx].word_to_tokens(word_id)
    word_start, word_end =  tokenized_captions[idx].word_to_chars(word_id)

    word=sentence[word_start:word_end]
    word_embeddings[word] = contextualized_token_embeddings[idx][token_start:token_end].mean(dim=0)
  
  return word_embeddings

result = []
for idx, sentence in enumerate(captions):
  word_embeddings = fetch_word_embeddings(idx, sentence, tokenized_captions, contextualized_token_embeddings)
  result.append({"sentence": sentence, "word_embeddings":word_embeddings})

# contextualized word embedding of the word `ipsum` of the second caption   
print(result[1]['word_embeddings']['ipsum'].shape)

Output:

torch.Size([5, 9, 768])
torch.Size([768])

Sentence embeddings

Sentence embeddings represent the whole sentence in a vector. There are different strategies to retrieve them. Commonly used are mean or cls-pooling, with mean-pooling delivering better results as shown in this paper section 6. The "only" challenge from a technical perspective (compare warning preamble) is, that you want to exclude the padding tokens:

# has 1 for none-padding-tokens and 0 for padding-tokens
attention_mask = tokenized_captions.attention_mask.unsqueeze(-1)

# mutiply the contextualized embeddings with the attention mask to 
# set the padding token weights to zero  
sum_embeddings = torch.sum(contextualized_token_embeddings * attention_mask,1)
print(sum_embeddings.shape)
num_none_padding_tokens = attention_mask.sum(1)
print(num_none_padding_tokens)
sentence_embeddings = sum_embeddings / num_none_padding_tokens
print(sentence_embeddings.shape)

Output:

torch.Size([5, 768])
tensor([[4],
        [7],
        [9],
        [3],
        [3]])
torch.Size([5, 768])

You also wanted to know in the comments if you could use the pooler_output of roberta-base directly to retrieve the sentence embeddings. Yes, you can do that. The pooler_output is retrieved via a form of cls-pooling (code).

Please note in addition to the warning preamble that the layers used for to generate the pooler_output are randomly initialized (i.e. untrained) for the roberta-base weights you load. That means they are even less meaningful!

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']