Why the call to thrust::inclusive_scan is much slower than subsequent calls?

When I call thrust::inclusive_scan several times, why is the first time much slower than subsequent calls?

Here is the code

float ttime;
for(int i=0;i<5;i++){
     cudaEvent_t start,stop; 
     cudaEventCreate(&start); 
     cudaEventCreate(&stop);
     cudaEventRecord(start,0);

     thrust::device_ptr<int > din(device_input);
     thrust::device_ptr<int > dout(device_output);
     thrust::inclusive_scan(din,din+N,dout);

     cudaEventRecord(stop,0); 
     cudaEventSynchronize(stop); 
     cudaEventElapsedTime(&ttime,start,stop);
     printf("cost %fms\n",ttime);
}

I run it on GTX1080,and result are

cost 39.180702ms
cost 0.200704ms
cost 0.201728ms
cost 0.202752ms
cost 0.197632ms

Can anybody help explain this?

Solution

Thrust is built using the CUDA runtime API, and that API uses lazy context initialisation.

The exact initialisation sequence is not documented and there is empirical evidence that it had changed over time. However, it appears that context setup is done on an ad hoc basis.

It is likely that the slow first call is related to loading and initialisation of the module containing the thrust code within your program. You might be able to verify this by profiling your code and looking at the profile execution time versus the wall clock time for that first call.