I have a model that represents a collection of documents in multidimensional vector space. So, for example, for 100k documents, my model represents them in the form of 300 dimensional vectors. So, finally, I get a matrix of size [100K, 300]
. For retrieving those documents according to relevance to the given query, I do matrix multiplication. For example, I represent a given query as a [300, 1]
. Then I get the cosine similarity scores using matrix multiplication as follows :
[100K, 300]*[300, 1] = [100K, 1]
.
Now how can I retrieve top 1000 documents from this collection with highest cosine similarity. The trivial way would be to sort based on cosine similarity and grab the first 1000 docs. Is there any way to retrieve the documents this way using some function in pytorch?
I mean, how can I get the indices of highest 1000 values from a 1D torch tensor?p
Once you have the similarity scores after the dot product. you can get the top 1000 indices as follows
top_indices = torch.argsort(sims)[:1000]
similar_docs = sims[top_indices]