Search code examples
pythonspatialknn

Spatial data structures with efficient dynamic updates


I am looking for a python library which does approximate (or exact, no problem with that) nearest neighbor search which has fast dynamic updates. (it seems that the scipy.spatial ones, like KDTree or BallTree do not. It would be even better if this were GPU compatible, but we should not be too greedy.


Solution

  • Have a look at https://github.com/erikbern/ann-benchmarks I don't really know the project, but it appears to written in Python and implements all kind of (A)NN search indices. However, I assume many of them are not written in Python but other languages.

    Unfortunately, I think this project does not benchmark updateability.

    Personally, I know next to nothing about ANN, so I am going to focus on exact NN. The best index for you situation depends on several factors: do you dynamically only add or also remove entries? How many dimension do you have? How big is your dataset?

    Tree types:

    • In my experience, KDtrees are actually not very good at removing entries, but they are good as long as you only add entries (occasional rebalancing may be required).

    If you want to quickly add AND remove entries, I suggest looking at LSH (locality sensitive hashing), R-Trees, and quadtrees.

    • However, as far as I am aware, LSH doesn't work well with higher dimensionality, but I don't really know LSH.
    • R-trees deal much better with higher dimensionality, at least up to a few 10s of 100s of dimensions; I found that the performance depends a lot on the specific implementation.
    • Most quadtrees do not work well with high dimensions, however, the PH-tree (disclaimer: my own project) works quite well with high dimensions, I have tested the Java version with 1000 dimensions and it worked "fine" (however, with that many dimensions, "fine" just means it is not worse than a brute force scan); I suggest using it for up to 50 dimensions or so.

    If you are interested, I compared numerous indexes (Java implementations) here. Figure 32-35 show kNN performance for 1NN and 10NN queries with 1'000'000 entries up to 40 dimensions. CU-P is a dataset with uniformly distributed points, CL-P is a highly clustered dataset. Fig. 42 and 44 show update rates (removing a point and inserting another point somewhere else).

    Implementations of the indexes are available here (Java), the PH-tree is also available in C++.