Sorting the smallest K elements of a vector to implement a brute-force K-nearest neighbor algorithm on GPU

I have implemented a K-nearest neighbor on the GPU using both pure CUDA and Thrust library function calls.

Euclidean distances are computed with a pure CUDA kernel. Then, Thrust sorting facilities (radix sort) are used to sort the distances in increasing order. Finally, the K first elements (i.e. the K nearest neighbors) are retrieved from the sorted vectors.

My implementation works well. However, sorting the entire euclidean distances matrix (sets can contain more than 250000 train samples) just to retrieve the K-nn seems non-optimal.

Therefore, I'm searching for a GPU algorithm implementation which allows to stop the sorting computations once the K smallest elements are found, or which performs an efficient K out of N sorting. It would indeed be faster for small K than sorting the entire matrix.

If such an implementation is not available, I would also be interested by advices to implement it efficiently in pure CUDA or Thrust. I was thinking to use a few threads per test samples to look for K nearest, each thread running to a part of the euclidean distances. I would maintain a buffer of size K in shared memory. I would run through the distances and insert the Knn in the shared memory vector. However, it would require some warp level synchronization and thread divergence.

Thank you for your help.

Solution

You are seeking an approach for the K-nearest neighbor problem consisting of two steps:

Finding the Euclidean distances between the elements;
Finding the first K elements providing the K smallest distances.

It seems that such an approach is already existing and has been implemented in

K.Kato and T.Hosino, "Solving k-Nearest Neighbor Problem on Multiple Graphics Processors"

and presented at the 2009 GTC Conference as

K.Kato and T.Hosino, "You Might Also Like: A Multi-GPU Recommendation System".

The approach solves the above two steps by

using the classical N-body approach developed in L.Nyland, M.Harris, J.Prins, "Fast N-body simulation with CUDA," In: GPU Gems III. NVIDIA (2007) 677–695 to calculate the Euclidean distances;
using the partial sorting technique, based on a parallel heapsort idea.

Again, as mentioned in my comment above, a better approach avoiding your "brute-force" one would be to use KD-trees.