When I call thrust::inclusive_scan
several times, why is the first time much slower than subsequent calls?
Here is the code
float ttime;
for(int i=0;i<5;i++){
cudaEvent_t start,stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
thrust::device_ptr<int > din(device_input);
thrust::device_ptr<int > dout(device_output);
thrust::inclusive_scan(din,din+N,dout);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&ttime,start,stop);
printf("cost %fms\n",ttime);
}
I run it on GTX1080,and result are
cost 39.180702ms
cost 0.200704ms
cost 0.201728ms
cost 0.202752ms
cost 0.197632ms
Can anybody help explain this?
Thrust is built using the CUDA runtime API, and that API uses lazy context initialisation.
The exact initialisation sequence is not documented and there is empirical evidence that it had changed over time. However, it appears that context setup is done on an ad hoc basis.
It is likely that the slow first call is related to loading and initialisation of the module containing the thrust code within your program. You might be able to verify this by profiling your code and looking at the profile execution time versus the wall clock time for that first call.