I am trying to encode a list of sentences into a list of embeddings. When I use a model that is in the HuggingFace hub, it works as expected. But when I use a model not in the hub, in this case Facebook's M2M100
model, I do not get the expected results.
When using a model within SentenceTransformer()
, my results look like this:
from sentence_transformers import SentenceTransformer
dat = ['Meteorite fell on the road ', 'I went in the wrong direction']
model_1 = SentenceTransformer('all-distilroberta-v1')
embeddings_1 = model_1.encode(dat)
embeddings_1.shape
> (2, 768)
However, when I use the M2M100
model, my results do not look right at all, specifically I would expect 2 rows of results:
from transformers import M2M100Tokenizer
model_m2m = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
model_m2m.src_lang = "en"
embeddings_m2m = model_m2m.encode(dat, return_tensors="pt")
embeddings_m2m.shape
> torch.Size([1, 4])
How should I format this so that it returns an n-dimensional list of embeddings, where each row corresponds to a sentence and the number of columns is equal to the dimensionality of the embedding?
(As a note, eventually I will be doing this for sentences in other languages, which is why I'm using a multi-lingual model.)
The code you provided only uses the tokenizer of the model, which maps the text to integer ids that don't represent any kind of (semantical) meaning.
To retrieve sentence embeddings (i.e. a vector that represents the text) from facebook/m2m100_418M, which is an encoder-decoder model, you need to perform some kind of pooling over the last-hidden-state of the encoder. Common approaches, which are cls and mean pooling are shown in the example below:
import torch
from transformers import M2M100Tokenizer, M2M100Model
def mean_pooling(last_hidden_state, attention_mask):
non_pad_tokens = attention_mask.sum(1)
sum_embeddings = torch.sum(attention_mask.unsqueeze(-1) * last_hidden_state, 1)
return sum_embeddings/non_pad_tokens.unsqueeze(-1)
def cls_pooling(last_hidden_state):
return last_hidden_state[:,0]
dat = ['Meteorite fell on the road ', 'I went in the wrong direction']
model_id = "facebook/m2m100_418M"
t_m2m = M2M100Tokenizer.from_pretrained(model_id)
t_m2m.src_lang = "en"
m_m2m = M2M100Model.from_pretrained(model_id)
tokenized = t_m2m(dat, padding=True, return_tensors='pt')
with torch.inference_mode():
encoder_o = m_m2m.encoder(**tokenized)
encoder_last_hidden_state = encoder_o.last_hidden_state
print(encoder_last_hidden_state.shape)
mean_pooling_embeddings = mean_pooling(encoder_last_hidden_state, tokenized.attention_mask)
print(mean_pooling_embeddings.shape)
cls_pooling_embeddings = cls_pooling(encoder_last_hidden_state)
print(cls_pooling_embeddings.shape)
Output:
torch.Size([2, 9, 1024])
torch.Size([2, 1024])
torch.Size([2, 1024])
Which of the two approaches works better for your downstream task, must be tested with your data. Please also note that even when you have the sentence embeddings now, it doesn't mean they are semantically meaningful (i.e. the embeddings are useless for your downstream task). Refer to this StackOverflow answer for further explanation.