Search code examples
information-retrievalembeddingword-embeddingdata-retrieval

What is the most efficient way to store a set of points (embeddings) such that queries for closest points are computed quickly


Given a set of embeddings, i.e. set of [name, vector representation] how should I store it such that queries on the closest points are computed quickly. For example given 100 embeddings in 2-d space, if I query the data struct on the 5 closest points to (10,12), it returns { [a,(9,11.5)] , [b,(12,14)],...}

The trivial approach is calculate all distances, sort and return top-k points. Alternatively, one might think of storing in a 2-d array in blocks/units of mXn space to cover the range of the embedding space. I don't think this is extensible to higher dimensions, but I'm willing to be corrected.


Solution

  • There are standard approximate nearest neighbors library such as faiss, flann, java-lsh etc. (which are either LSH or Product Quantization based), which you may use.

    The quickest solution (which I found useful) is to transform a vector of (say 100 dimensions) to a long variable (64 bits) by using the Johnson–Lindenstrauss transform. You can then use Hamming similarity (i.e. 64 minus the number of bits set in a XOR b) to compute the similarity between bit vectors a and b. You could use the POPCOUNT machine instruction to this effect (which is very fast).

    In effect, if you use POPCOUNT in C, even if you do a complete iteration over the whole set of binary transformed vectors (long variables of 64 bits), it still will be very fast.