I'm evaluating CUDA and currently using Thrust library to sort numbers.
I'd like to create my own comparer for thrust::sort, but it slows down drammatically! I created my own less implemetation by just copying code from functional.h. However it seems to be compiled in some other way and works very slowly.
I'm using Visual Studio 2010. What should I do to get the same performance as at option 1?
Complete code:
#include <stdio.h>
#include <cuda.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
int myRand()
{
static int counter = 0;
if ( counter++ % 10000 == 0 )
srand(time(NULL)+counter);
return (rand()<<16) | rand();
}
template<typename T>
struct less : public thrust::binary_function<T,T,bool>
{
__host__ __device__ bool operator()(const T &lhs, const T &rhs) const {
return lhs < rhs;
}
};
int main()
{
thrust::host_vector<int> h_vec(10 * 1000 * 1000);
thrust::generate(h_vec.begin(), h_vec.end(), myRand);
thrust::device_vector<int> d_vec = h_vec;
int clc = clock();
thrust::sort(d_vec.begin(), d_vec.end(), less<int>());
printf("%dms\n", (clock()-clc) * 1000 / CLOCKS_PER_SEC);
return 0;
}
The reason you are observing a difference in performance is because Thrust is implementing the sort with different algorithms depending on the arguments provided to thrust::sort
.
In case 1., Thrust can prove that the sort can be implemented in linear time with a radix sort. This is because the type of the data to sort is a built-in numeric type (int
), and the comparison function is the built-in less than operation -- Thrust recognizes that thrust::less<int>
will produce the equivalent result as x < y
.
In case 2., Thrust knows nothing about your user-provided less<int>
, and has to use a more conservative algorithm based on a comparison sort which has different asymptotic complexity, even though in truth your less<int>
is equivalent to thrust::less<int>
.
In general, user-defined comparison operators can't be used with more restrictive, faster sorts which manipulate the binary representation of data such as radix sort. In these cases, Thrust falls back on a more general, but slower sort.