Search code examples
pytorchcosine-similarity

Why does torch cosine similarity between exactly same vectors give similarity of zero instead of one?


I have two tensors and I want to calculate the cosine similarity between them in Pytorch:

a = torch.tensor([[0.,0.,0.,0.,0.]])
b = torch.tensor([[0.,0.,0.,0.,0.]])

I calculate the cosine similarity matrix using the following function:

def calc_similarity_batch(a, b):
    representations = torch.cat([a, b], dim=0)
    return F.cosine_similarity(representations.unsqueeze(1), representations.unsqueeze(0), dim = 2)

To my surprise the similarity matrix calculated by cosine_similarity function is:

tensor([[0., 0.],
        [0., 0.]])

While it should have been:

tensor([[1., 1.],
        [1., 1.]])

since the vectors are the same. Could someone explain what is wrong with my code?


Solution

  • You are correct, the cosine similarity between any equal vectors should be one... except for a vector of zero length. In this case you have a division by zero and the result is undefined. The implementation you use seems to handle this case with a similarity of 0.

    The image shows the formular and there you will see that the denominator is zero in your case.

    enter image description here

    If zero is a good choice to handle this "forbidden" case? I don't know. Since cosine similarity measures the angle between two vectors (and not if two vectors are equal), both -1 and +1 don't seem to be good either. Using 0 might just be the least informative compromise. Also, consider that the denominator is zero if ANY of the two vectors has zero length. Your case is a special case in this scenario.