Search code examples
algorithmdata-structuresnearest-neighbortradeoffapproximate-nn-searching

Best data structure for high dimensional nearest neighbor search


I'm actually working on high dimensional data (~50.000-100.000 features) and nearest neighbors search must be performed on it. I know that KD-Trees has poor performance as dimensions grows, and also I've read that in general, all space-partitioning data structures tends to perform exhaustive search with high dimensional data.

Additionally, there are two important facts to be considered (ordered by relevance):

  • Precision: The nearest neighbors must be found (not approximations).
  • Speed: The search must be as fast as possible. (The time to create the data structure isn't really important).

So, I need some advice about:

  1. The data structure to perform k-NN.
  2. If it will be better to use an aNN (approximate nearest neighbor) approach, setting it as accurate as possible?.

Solution

  • Can I perform NN search in high dimensional space?

    No. Because of the curse of dimensionality, data structures that perform Nearest Neighbour search good in lower dimensions, fail to perform well in a high dimensional place. As a matter of fact, the query times becomes almost equally to the brute force one, thus it is worthless.

    As a result, in a high dimensional space, one should go for Approximate Nearest Neighbour (ANN) search. To be honest, it's a must.

    Which data structure to perform ANN?

    I would suggest LSH, or a number of RKD trees. In my answer here, I mention some good libraries that perform ANN in C++. However, notice that LSH solved the R-nearest neighbour problem, thus you specify the parameter R, which is actually the radius. Then, LSH will look for NN's inside that R from the query point, thus you can't really request k NN's.

    On the other hand, RKD trees can do that and return you k NN's. I have a project, that builds a forest of RKD trees and performs ANN search in C++, but it targets only high dimensions. It can handle GIST datasets of 10^6 images in 960 dimensions in < 1 sec with about 90% outputs being true nearest neighbours. The name is kd-GeRaF. It will be updated in the next month with a distributed version, but it's already tested and ready to use. It also has a cute logo. :)


    I also feel that you should read my answer, which says that the optimal data structure depends on the data.