Search code examples
algorithmdata-structuresheapasymptotic-complexity

Most efficient implementation to get the closest k items


In the K-Nearest-Neighbor algorithm, we find the top k neighbors closest to a new point out of N observations and use those neighbors to classify the point. From my knowledge of data structures, I can think of two implementations of this process:

Approach 1

  1. Calculate the distances to the new point from each of N observations
  2. Sort the distances using quicksort and take the top k points

Which would take O(N + NlogN) = O(NlogN) time.

Approach 2

  1. Create a max-heap of size k
  2. Calculate the distance from the new point for the first k points
  3. For each following observation, if the distance is less than the max in the heap, pop that point from the heap and replace it with the current observation
  4. Re-heapify (logk operations for N points)
  5. Continue until there are no more observations at which point we should only have the top 5 closest distances in the heap.

This approach would take O(N + NlogK) = O(NlogK) operations.

Are my analyses correct? How would this process be optimized in a standard package like sklearn? Thank you!


Solution

  • Here's a good overview of the common methods used: https://en.wikipedia.org/wiki/Nearest_neighbor_search

    What you describe is linear search (since you need to compute the distance to every point in the dataset). The good thing is that this always works. The bad thing about it is that is slow, especially if you query it a lot.

    If you know a bit more about your data you can get better performance. If the data has low dimensionality (2D, 3D) and is uniformly distributed (this doesn't mean perfectly, just not in very dense and very tight clusters) then space partitioning works great because it cuts down quickly on the points that are too far anyway (complexity O(logN) ). They work also for higher dimensionallity or if there are some clusters, but the performance suffers a bit (still better overall than linear search).

    Usually space partitioning or locality sensitive hashing are enough for common datasets.

    The trade-off is that you use more memory and some set-up time to speed up future queries. If you have a lot of queries then it's worth it. If you only have a few, not so much.