python pytorch huggingface-transformers bert-language-model

Different embeddings for same sentences with torch transformer

Hey all and apologies in advance for what is probably a fairly basic question - I have a theory about what's causing the issue here, but would be great to confirm with people who know more about this than I do.

I've been trying to implement this python code snippet in Google colab. The snippet is meant to work out similarity for sentences. The code runs fine, but what I'm finding is that the embeddings and distances change every time I run it, which isn't ideal for my intended use case.

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Import our models. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer.from_pretrained("qiyuw/pcl-bert-base-uncased")
model = AutoModel.from_pretrained("qiyuw/pcl-bert-base-uncased")

# Tokenize input texts
texts = [
    "There's a kid on a skateboard.",
    "A kid is skateboarding.",
    "A kid is inside the house."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))

I think the issue must be model specific since I receive the warning about newly initialized pooler weights, and pooler_output is ultimately what the code reads to inform similarity:

Some weights of RobertaModel were not initialized from the model checkpoint at qiyuw/pcl-roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Switching to an alternative model which does not give this warning (for example, sentence-transformers/all-mpnet-base-v2) makes the outputs reproducible, so I think it is because of the above warning about initialization of weights. So here are my questions:

Can I make the output reproducible by initialising/seeding the model differently?
If I can't make the outputs reproducible, is there a way in which I can improve the accuracy to reduce the amount of variation between runs?
Is there a way to search huggingface models for those which will initialise the pooler weights so I can find a model which does suit my purposes?

Thanks in advance

Solution

You are correct the model layer weights for bert.pooler.dense.bias and bert.pooler.dense.weight are initialized randomly. You can initialize these layers always the same way for a reproducible output, but I doubt the inference code that you have copied from there readme is correct. As already mentioned by you the pooling layers are not initialized and their model class also makes sure that the pooling_layer is not added:

...
self.bert = BertModel(config, add_pooling_layer=False)
...

The evaluation script of the repo should be called, according to the readme with the following command:

python evaluation.py --model_name_or_path qiyuw/pcl-bert-base-uncased --mode test --pooler cls_before_pooler

When you look into it, your inference code for qiyuw/pcl-bert-base-uncased should be the following way:

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Import our models. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer.from_pretrained("qiyuw/pcl-bert-base-uncased")
model = AutoModel.from_pretrained("qiyuw/pcl-bert-base-uncased")

# Tokenize input texts
texts = [
    "There's a kid on a skateboard.",
    "A kid is skateboarding.",
    "A kid is inside the house."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.inference_mode():
    embeddings = model(**inputs)
    embeddings = embeddings.last_hidden_state[:, 0]

# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))

Output:

Cosine similarity between "There's a kid on a skateboard." and "A kid is skateboarding." is: 0.941
Cosine similarity between "There's a kid on a skateboard." and "A kid is inside the house." is: 0.779

Can I make the output reproducible by initialising/seeding the model differently?

Yes, you can. Use torch.maunal_seed:

import torch
from transformers import AutoModel, AutoTokenizer

model_random = AutoModel.from_pretrained("qiyuw/pcl-bert-base-uncased")
torch.manual_seed(42)
model_repoducible1 = AutoModel.from_pretrained("qiyuw/pcl-bert-base-uncased")

torch.manual_seed(42)
model_repoducible2 = AutoModel.from_pretrained("qiyuw/pcl-bert-base-uncased")

print(torch.allclose(model_random.pooler.dense.weight, model_repoducible1.pooler.dense.weight))
print(torch.allclose(model_random.pooler.dense.weight, model_repoducible2.pooler.dense.weight))
print(torch.allclose(model_repoducible1.pooler.dense.weight, model_repoducible2.pooler.dense.weight))

Output:

False
False
True