I am trying to find the most similar word embeddings between 2 arrays, let's say the first array A
has dimension [100, 50]
, where 100 is the number of words and 50 the embedding dimension, on the other hand I have many other words stored in an array B
of dimension [400000, 50]
, I want to find the top 10 (most similar, highest cosine similarity) words for each word from A
in B
. In other words, for each word in A
, I want to find 10 words in B
with the highest cosine similarity.
I solved it using 2 for loops, but I want to know if there is any way to do this faster since my method takes time when I increase the number of samples on A
, any trick or advice would be helpful. I am using the cosine_similarity function from torch
, if there is another choice that is faster, it would also be great. I have tried the solution posted here, but I would like to know if there is any better solution as this one is 5 years old. Thanks in advance.
A friend shared his solution, and there is a way to do it much faster, this is the code:
def cosine_similarity(x: Tensor, y: Tensor) -> Tensor:
return torch.cosine_similarity(x[..., None], y.T, dim=-2)
This way the output is a matrix and there is no need to loop through both datasets.