In Python, I have a text query variable and a dataset structured as follows:
text = "hey how are you doing today love"
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]
I am trying to use the following pipeline to calculate the cosine similarity between the Dolly embeddings of text and dataset as follows:
# Import Pipeline
from transformers import pipeline
import torch
import accelerate
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
# Create Feature Extraction Object
feature_extraction = pipeline('feature-extraction',
model='databricks/dolly-v2-3b',
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto")
# Define Inputs
text = ["hey how are you doing today love"]
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]
# Create Embeddings
text_embeddings = feature_extraction(text)[0]
dataset_embeddings = feature_extraction(dataset)
text_embeddings = np.array(text_embeddings)
dataset_embeddings = np.array(dataset_embeddings)
text_embeddings = normalize(text_embeddings, norm='l2')
dataset_embeddings = normalize(dataset_embeddings, norm='l2')
cosine_similarity = np.dot(text_embeddings, dataset_embeddings.T)
angular_distance = np.arccos(cosine_similarity) / np.pi
The L2 normalization is failing, and if I "comment it out" I am running into the following error:
ValueError: shapes (1,7,2560) and (1,3) not aligned: 2560 (dim 2) != 1 (dim 0)
I know that the error has something to do with the misalignment of the shapes of text_embeddings and dataset_embeddings. However, I am not sure about what I can do to resolve it.
Help!
There's a couple of things happening here:
dolly-v2-3b
gives you multiple embeddings for a given text input, where the number of embeddings depends on the input you provide. For example, while the model provides 7 embeddings (also called vectors) for the first sentence in dataset
, it provides 4 embeddings for the subsequent 2.cosine similarity
measures the similarity between two vectors. The code you provided tries to compare multiple vectors of one sentence with multiple vectors of another sentence; which violates the aforementioned operation that cosine similarity
performs. Therefore, before performing the similarity computations, we need to "condense" the embeddings into a single vector - the code below uses a technique called "vector averaging", which simply computes the average of the vectors.np.average
(which is used for vector averaging) and np.normalize
for each sentence individually in dataset
.The code below runs without error and returns a cosine similarity
of 1
for the first comparison where we compare the sentence to itself, which is expected. Moreover, the undefined np.NaN
angular difference between the two identical vectors of the first comparison also makes sense.
# Installations required in Google Colab
# %pip install transformers
# %pip install torch
# %pip install accelerate
from transformers import pipeline
import torch
import accelerate
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
# Create Feature Extraction Object
feature_extraction = pipeline('feature-extraction',
model='databricks/dolly-v2-3b',
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto")
# Define Inputs
text = ["hey how are you doing today love"]
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]
# Create Embeddings
text_embeddings = feature_extraction(text)
dataset_embeddings = feature_extraction(dataset)
# Perform Vector Averaging
text_embeddings_avg = np.average(text_embeddings[0], axis=1)
dataset_embeddings_avg = np.array(
[
np.average(text_embedding, axis=1)
for text_embedding
in dataset_embeddings
]
)
print(text_embeddings_avg.shape) # (1, 2560)
print(dataset_embeddings_avg.shape) # (3, 1, 2560)
# Perform Normalization
text_embeddings_avg_norm = normalize(text_embeddings_avg, norm='l2')
dataset_embeddings_avg_norm = np.array(
[
normalize(text_embedding, norm='l2')
for text_embedding
in dataset_embeddings_avg
]
)
print(text_embeddings_avg_norm.shape) # (1, 2560)
print(dataset_embeddings_avg_norm.shape) # (3, 1, 2560)
# Cosine Similarity
cosine_similarity = np.array(
[
np.dot(text_embeddings_avg_norm, text_embedding.T)
for text_embedding
in dataset_embeddings_avg_norm
]
)
angular_distance = np.arccos(cosine_similarity) / np.pi
print(cosine_similarity.tolist()) # [[[1.0000000000000007]], [[0.7818918337438344]], [[0.7921756683919716]]]
print(angular_distance.tolist()) # [[[nan]], [[0.21425490131377858]], [[0.2089483418862303]]]