Search code examples
searchdeep-learningnlpinformation-retrieval

Efficient retrieval of documents represented in the form of multi-dimensional vectors


I've trained a deep neural network based model for information retrieval. At the end, my model represents the documents in the form of 128 dimensional vectors. Semantic representations of documents similar to word embedding representation for words (word2vec algorithm). When I give a query to my model, it also represents the query in the same 128 dimensional vector space. Now from the entire vector space, I want to retrieve top k documents closest the the query vector represented in the same vector space.

The similarity measure is cosine similarity which is defined as follows :

sim(Q, D) = np.dot(Q.T, D)/(np.linalg.norm(Q) * np.linalg.norm(D))

where sim(Q, D) represents similarity between query Q and document D. In simple words, it is dot product of unit vectors of query and document.
Now I have roughly 36 million documents, so calculating cosine similarity for all the documents and the sorting them is not a feasible option for efficient retrieval. I want to efficiently search for the most similar k documents for any query vector represented in the same 128 dimensional vector space.


Solution

  • Use an approximate nearest neighbor (ANN) search library, such as nmslib. These libraries would allow you to index dense vectors and retrieve a list of such indexed vectors given a query. Some example ipython notebooks can be found here.