Search code examples
pythonpytorchembeddingcosine-similarity

Cosine similarity between two arrays for word embeddings


I am trying to find the most similar word embeddings between 2 arrays, let's say the first array A has dimension [100, 50], where 100 is the number of words and 50 the embedding dimension, on the other hand I have many other words stored in an array B of dimension [400000, 50], I want to find the top 10 (most similar, highest cosine similarity) words for each word from A in B. In other words, for each word in A, I want to find 10 words in B with the highest cosine similarity.

I solved it using 2 for loops, but I want to know if there is any way to do this faster since my method takes time when I increase the number of samples on A, any trick or advice would be helpful. I am using the cosine_similarity function from torch, if there is another choice that is faster, it would also be great. I have tried the solution posted here, but I would like to know if there is any better solution as this one is 5 years old. Thanks in advance.


Solution

  • A friend shared his solution, and there is a way to do it much faster, this is the code:

    def cosine_similarity(x: Tensor, y: Tensor) -> Tensor:
        return torch.cosine_similarity(x[..., None], y.T, dim=-2)
    

    This way the output is a matrix and there is no need to loop through both datasets.