I use (CUDA C++) Thrust for GPU GeForce GTX 460SE with asyncEngineCount = 1. As I know I can overlap transfer data one of way to/from GPU and executing single kernel. But when I use:
cudaStream_t Stream1, Stream2;
cudaStreamCreate(&Stream1);
cudaStreamCreate(&Stream2);
cudaMemcpyAsync(thrust::raw_pointer_cast(d_vec_src.data()), host_ptr1, test_size, cudaMemcpyHostToDevice, Stream1);
cudaMemcpyAsync(host_ptr2, thrust::raw_pointer_cast(d_vec_dst.data()), test_size, cudaMemcpyDeviceToHost, Stream2);
thrust::sort(d_vec_dst.begin(), d_vec_dst.end());
cudaThreadSynchronize();
and Thrust algorithms, it executes sequentially as I see in nVidia Visual Profiler: transfer from GPU, transfer to GPU, executing kernel. Maybe this is because Thrust algorithms executing in zero 0-stream which can't overlap with anything? And how solve this problem?
Thrust doesn't presently have a mechanism for controlling the execution stream of its algorithms, so you can't do what you are asking with the current code base. There have been reports of users modifying the thrust code base to accept a stream (for example this google groups thread) but that may or may not be viable depending on the complexity of the algorithm you use and its structure. Some algorithms also have internal data transfers and you would need to be very careful to not break things when moving from serial to asynchronous execution.