Search code examples
pythonnlppytorchword-embedding

Sentence embedding using T5


I would like to use state-of-the-art LM T5 to get sentence embedding vector. I found this repository https://github.com/UKPLab/sentence-transformers As I know, in BERT I should take the first token as [CLS] token, and it will be the sentence embedding. In this repository I see the same behaviour on T5 model:

cls_tokens = output_tokens[:, 0, :]  # CLS token is first token

Does this behaviour correct? I have taken encoder from T5 and encoded two phrases with it:

"I live in the kindergarden"
"Yes, I live in the kindergarden"

The cosine similarity between them was only "0.2420".

I just need to understand how sentence embedding works - should I train network to find similarity to reach correct results? Or I it is enough of base pretrained language model?


Solution

  • In order to obtain the sentence embedding from the T5, you need to take the take the last_hidden_state from the T5 encoder output:

    model.encoder(input_ids=s, attention_mask=attn, return_dict=True)
    pooled_sentence = output.last_hidden_state # shape is [batch_size, seq_len, hidden_size]
    # pooled_sentence will represent the embeddings for each word in the sentence
    # you need to sum/average the pooled_sentence
    pooled_sentence = torch.mean(pooled_sentence, dim=1)
    

    You have now a sentence embeddings from T5