In the K-Nearest-Neighbor algorithm, we find the top k neighbors closest to a new point out of N observations and use those neighbors to classify the point. From my knowledge of data structures, I can think of two implementations of this process:
Approach 1
Which would take O(N + NlogN) = O(NlogN) time.
Approach 2
This approach would take O(N + NlogK) = O(NlogK) operations.
Are my analyses correct? How would this process be optimized in a standard package like sklearn? Thank you!
Here's a good overview of the common methods used: https://en.wikipedia.org/wiki/Nearest_neighbor_search
What you describe is linear search (since you need to compute the distance to every point in the dataset). The good thing is that this always works. The bad thing about it is that is slow, especially if you query it a lot.
If you know a bit more about your data you can get better performance. If the data has low dimensionality (2D, 3D) and is uniformly distributed (this doesn't mean perfectly, just not in very dense and very tight clusters) then space partitioning works great because it cuts down quickly on the points that are too far anyway (complexity O(logN) ). They work also for higher dimensionallity or if there are some clusters, but the performance suffers a bit (still better overall than linear search).
Usually space partitioning or locality sensitive hashing are enough for common datasets.
The trade-off is that you use more memory and some set-up time to speed up future queries. If you have a lot of queries then it's worth it. If you only have a few, not so much.