Search code examples
pythonnumpyhuggingface-transformersvalueerror

Computing the Cosine Similarity of Embeddings Generated by the Dolly Model on the Hugging Face Hub


In Python, I have a text query variable and a dataset structured as follows:

text = "hey how are you doing today love"
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]

I am trying to use the following pipeline to calculate the cosine similarity between the Dolly embeddings of text and dataset as follows:

# Import Pipeline

from transformers import pipeline
import torch
import accelerate
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize


# Create Feature Extraction Object

feature_extraction = pipeline('feature-extraction',
                              model='databricks/dolly-v2-3b', 
                              torch_dtype=torch.bfloat16,
                              trust_remote_code=True, 
                              device_map="auto")


# Define Inputs

text = ["hey how are you doing today love"]
dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]

# Create Embeddings

text_embeddings = feature_extraction(text)[0]
dataset_embeddings = feature_extraction(dataset)

text_embeddings = np.array(text_embeddings)
dataset_embeddings = np.array(dataset_embeddings)

text_embeddings = normalize(text_embeddings, norm='l2')
dataset_embeddings = normalize(dataset_embeddings, norm='l2')

cosine_similarity = np.dot(text_embeddings, dataset_embeddings.T)
angular_distance = np.arccos(cosine_similarity) / np.pi

The L2 normalization is failing, and if I "comment it out" I am running into the following error:

ValueError: shapes (1,7,2560) and (1,3) not aligned: 2560 (dim 2) != 1 (dim 0)

I know that the error has something to do with the misalignment of the shapes of text_embeddings and dataset_embeddings. However, I am not sure about what I can do to resolve it.

Help!


Solution

  • There's a couple of things happening here:

    • dolly-v2-3b gives you multiple embeddings for a given text input, where the number of embeddings depends on the input you provide. For example, while the model provides 7 embeddings (also called vectors) for the first sentence in dataset, it provides 4 embeddings for the subsequent 2.
    • cosine similarity measures the similarity between two vectors. The code you provided tries to compare multiple vectors of one sentence with multiple vectors of another sentence; which violates the aforementioned operation that cosine similarity performs. Therefore, before performing the similarity computations, we need to "condense" the embeddings into a single vector - the code below uses a technique called "vector averaging", which simply computes the average of the vectors.
    • It's required to call np.average (which is used for vector averaging) and np.normalize for each sentence individually in dataset.

    The code below runs without error and returns a cosine similarity of 1 for the first comparison where we compare the sentence to itself, which is expected. Moreover, the undefined np.NaN angular difference between the two identical vectors of the first comparison also makes sense.

    # Installations required in Google Colab
    # %pip install transformers
    # %pip install torch
    # %pip install accelerate
    
    from transformers import pipeline
    import torch
    import accelerate
    import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity
    from sklearn.preprocessing import normalize
    
    
    # Create Feature Extraction Object
    
    feature_extraction = pipeline('feature-extraction',
                                  model='databricks/dolly-v2-3b', 
                                  torch_dtype=torch.bfloat16,
                                  trust_remote_code=True, 
                                  device_map="auto")
    
    
    # Define Inputs
    
    text = ["hey how are you doing today love"]
    dataset = ["hey how are you doing today love", "I am doing great", "What about you?"]
    
    # Create Embeddings
    text_embeddings = feature_extraction(text)
    dataset_embeddings = feature_extraction(dataset)
    
    # Perform Vector Averaging
    text_embeddings_avg = np.average(text_embeddings[0], axis=1)
    dataset_embeddings_avg = np.array(
        [
            np.average(text_embedding, axis=1)
            for text_embedding
            in dataset_embeddings
        ]
    )
    print(text_embeddings_avg.shape)  # (1, 2560)
    print(dataset_embeddings_avg.shape)  # (3, 1, 2560)
    
    # Perform Normalization
    text_embeddings_avg_norm = normalize(text_embeddings_avg, norm='l2')
    dataset_embeddings_avg_norm = np.array(
        [
            normalize(text_embedding, norm='l2')
            for text_embedding
            in dataset_embeddings_avg
         ]
    )
    print(text_embeddings_avg_norm.shape)  # (1, 2560)
    print(dataset_embeddings_avg_norm.shape)  # (3, 1, 2560)
    
    # Cosine Similarity
    cosine_similarity = np.array(
        [
            np.dot(text_embeddings_avg_norm, text_embedding.T)
            for text_embedding
            in dataset_embeddings_avg_norm
        ]
    )
    angular_distance = np.arccos(cosine_similarity) / np.pi
    print(cosine_similarity.tolist())  # [[[1.0000000000000007]], [[0.7818918337438344]], [[0.7921756683919716]]]
    print(angular_distance.tolist())  # [[[nan]], [[0.21425490131377858]], [[0.2089483418862303]]]