Search code examples
postgresqlindexingsimilarityn-gram

Find nearest sparse vectors, what kind of index or DB to use?


I'd like to detect similar text documents.

There's a function that takes text as an input and produces vector as an output.

text => vector

The produced vector is sparse. Its dimension is huge (can't say for sure but probably will be about 10_000), but almost all of its elements are nulls. Only about 10-20 of its elements are not null.

vector = [0, 0, 0..., v1, 0...., v2, 0.... ]

So it makes sense to represent this sparse vector as a map instead of array.

vector = { i1: v1, i2: v2 }

What kind of index can I use to efficiently find N vectors closest to the given { i1: v1, i2: v2 } vector? The distance metric could be euclidean or cosine or other.

There are millions of documents. What kind of DB could be used to do such kind of search? PostgreSQL? Redis?


Solution

  • After meditating on Machine Learning stuff here's the answer:

    There's no ready to use DB or Index that can handle high dimensional spaces. There's tools like https://github.com/spotify/annoy but they only can handle dimensions < 1000

    Theoretically it's possible to handle high dimensional spaces using tricks like partitioning, but it's very case specific, no universal solution.

    The better way would be to reduce dimensionality using PCA to <1000 and then it will be possible to use tools like https://github.com/spotify/annoy