About thrust::execution_policy when copying data from device to host

I use thrust::copy to transfer data from device to host in a multi-GPU system. Each GPU has a equally sized partition of the data. Using OpenMP, I call the function on each device. On my current system I am working on 4 GPUs.

#pragma omp parallel for
for (size_t i = 0; i < devices.size(); ++i) 
{
    const int device = devices[i];
    thrust::copy(thrust::device, // execution policy
                 device_buffers->At(device)->begin(), // thrust::device_vector
                 device_buffers->At(device)->end(),
                 elements->begin() + (device * block_size)); // thrust::host_vector
}

After reading the documentation and the following post, I understand that the default thrust::execution_policy is chosen based on the iterators that are passed.

When copying data from device to host, both iterators are passed as function parameters.

1. Which execution policy is picked here per default? thrust::host or thrust::device?
After doing some benchmarks, I observe that passing thrust::device explicitly improves performance, compared to not passing an explicit parameter.
2. What could be the reason for the performance gain? The system is a POWER9 machine. How does thrust::copy and the specific execution policy work internally? How many of the 4 copy engines of each device are actually used?
However, nvprof does not display the [CUDA memcpy DtoH] category anymore and instead shows void thrust::cuda_cub::core [...] __parallel_for::ParallelForAgent [...] which even shows an increase in Time (s). This does not make sense because, as I said, I observed a consistent performance improvement (smaller total execution time) when using thrust::device.

3. Is this just a nvprof + thrust-specific behaviour that causes profiling numbers not to correlate with acutal execution time? I observed something similiar for cudaFree: It seems that cudaFree is returning control to the host code pretty fast which results in small execution time while nvprof shows much higher numbers because the actual deallocation probably happens in lazy fashion.

Solution

The Thrust doc on the thrust::device states the following:

Raw pointers allocated by host APIs should not be mixed with a thrust::device algorithm invocation when the device backend is CUDA

To my understanding, this means that host-device copy with thrust::device execution policy is invalid, in the first place, unless the host memory is pinned.

We imply that your host allocation is not pinned, BUT: One possibility is that on POWER9 with NVLINK you may be lucky that any host-allocated memory is addressable from within the GPU. Thanks to that, host-device copy with thrust::device works, though it should not.

On a regular system, host memory is addressable from within a GPU only if this host memory is allocated with cudaMallocHost (pinned). So, the question is whether your POWER system has automagically upgraded all allocations to be pinned. Is the observed performance bonus due to the implicitly-pinned memory, or would you get an additional speedup, if allocations are also done with cudaMallocHost explicitly?

Another Thrust design-based evidence is that thrust::device policy has par.on(stream) support, while thrust::host does not. This is pretty much aligned with the fact that asynchronous host-device copies are only possible with the pinned memory.