python numpy huggingface-transformers valueerror

Computing the Cosine Similarity of Embeddings Generated by the Dolly Model on the Hugging Face Hub

In Python, I have a text query variable and a dataset structured as follows:

text = "hey how are you doing today love"
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]

I am trying to use the following pipeline to calculate the cosine similarity between the Dolly embeddings of text and dataset as follows:

# Import Pipeline

from transformers import pipeline
import torch
import accelerate
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize


# Create Feature Extraction Object

feature_extraction = pipeline('feature-extraction',
                              model='databricks/dolly-v2-3b', 
                              torch_dtype=torch.bfloat16,
                              trust_remote_code=True, 
                              device_map="auto")


# Define Inputs

text = ["hey how are you doing today love"]
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]

# Create Embeddings

text_embeddings = feature_extraction(text)[0]
dataset_embeddings = feature_extraction(dataset)

text_embeddings = np.array(text_embeddings)
dataset_embeddings = np.array(dataset_embeddings)

text_embeddings = normalize(text_embeddings, norm='l2')
dataset_embeddings = normalize(dataset_embeddings, norm='l2')

cosine_similarity = np.dot(text_embeddings, dataset_embeddings.T)
angular_distance = np.arccos(cosine_similarity) / np.pi

The L2 normalization is failing, and if I "comment it out" I am running into the following error:

ValueError: shapes (1,7,2560) and (1,3) not aligned: 2560 (dim 2) != 1 (dim 0)

I know that the error has something to do with the misalignment of the shapes of text_embeddings and dataset_embeddings. However, I am not sure about what I can do to resolve it.

Help!

Solution

There's a couple of things happening here:

dolly-v2-3b gives you multiple embeddings for a given text input, where the number of embeddings depends on the input you provide. For example, while the model provides 7 embeddings (also called vectors) for the first sentence in dataset, it provides 4 embeddings for the subsequent 2.
cosine similarity measures the similarity between two vectors. The code you provided tries to compare multiple vectors of one sentence with multiple vectors of another sentence; which violates the aforementioned operation that cosine similarity performs. Therefore, before performing the similarity computations, we need to "condense" the embeddings into a single vector - the code below uses a technique called "vector averaging", which simply computes the average of the vectors.
It's required to call np.average (which is used for vector averaging) and np.normalize for each sentence individually in dataset.

The code below runs without error and returns a cosine similarity of 1 for the first comparison where we compare the sentence to itself, which is expected. Moreover, the undefined np.NaN angular difference between the two identical vectors of the first comparison also makes sense.

# Installations required in Google Colab
# %pip install transformers
# %pip install torch
# %pip install accelerate

from transformers import pipeline
import torch
import accelerate
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize


# Create Feature Extraction Object

feature_extraction = pipeline('feature-extraction',
                              model='databricks/dolly-v2-3b', 
                              torch_dtype=torch.bfloat16,
                              trust_remote_code=True, 
                              device_map="auto")


# Define Inputs

text = ["hey how are you doing today love"]
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]

# Create Embeddings
text_embeddings = feature_extraction(text)
dataset_embeddings = feature_extraction(dataset)

# Perform Vector Averaging
text_embeddings_avg = np.average(text_embeddings[0], axis=1)
dataset_embeddings_avg = np.array(
    [
        np.average(text_embedding, axis=1)
        for text_embedding
        in dataset_embeddings
    ]
)
print(text_embeddings_avg.shape)  # (1, 2560)
print(dataset_embeddings_avg.shape)  # (3, 1, 2560)

# Perform Normalization
text_embeddings_avg_norm = normalize(text_embeddings_avg, norm='l2')
dataset_embeddings_avg_norm = np.array(
    [
        normalize(text_embedding, norm='l2')
        for text_embedding
        in dataset_embeddings_avg
     ]
)
print(text_embeddings_avg_norm.shape)  # (1, 2560)
print(dataset_embeddings_avg_norm.shape)  # (3, 1, 2560)

# Cosine Similarity
cosine_similarity = np.array(
    [
        np.dot(text_embeddings_avg_norm, text_embedding.T)
        for text_embedding
        in dataset_embeddings_avg_norm
    ]
)
angular_distance = np.arccos(cosine_similarity) / np.pi
print(cosine_similarity.tolist())  # [[[1.0000000000000007]], [[0.7818918337438344]], [[0.7921756683919716]]]
print(angular_distance.tolist())  # [[[nan]], [[0.21425490131377858]], [[0.2089483418862303]]]