Slow sorting using Thrust, CUDA

I am a newbie to CUDA. I simply tried to sort an array using Thrust.

clock_t start_time = clock(); 

thrust::host_vector<int> h_vec(10);
thrust::generate(h_vec.begin(), h_vec.end(), rand);
thrust::device_vector<int> d_vec = h_vec;

thrust::sort(d_vec.begin(), d_vec.end());
//thrust::sort(h_vec.begin(), h_vec.end());

clock_t stop_time = clock(); 
printf("%f\n", (double)(stop_time - start_time) / CLOCKS_PER_SEC);

Time took to sort d_vec is 7.4s, and time took to sort h_vec is 0.4s

I am assuming its parallel computation on device memory, so shouldn't it be faster ?

Solution

Probably the main problem is context creation time: the first CUDA call will initialize the CUDA context which takes some time, see here. Therefore you should start measuring time only after the first CUDA call.

In general you can only expect speed-up with GPU code compared to CPU code if the degree of parallelism is high enough. The vector size of 10 as in the example code is definitely too small to achieve speed-up. With a vector size >> 10000 you can expect to fully utilize a modern GPU.

You should also think about measuring only the time for sorting without the copy d_vec = h_vec, since often you will work with the device vector in the next step. Then you can consider the copy operation as a one time setup cost. (However if sorting is the only operation on device it is of course reasonable to include the memcopy in the measurement.)