I use thrust::copy
to transfer data from device to host in a multi-GPU system. Each GPU has a equally sized partition of the data. Using OpenMP, I call the function on each device. On my current system I am working on 4 GPUs.
#pragma omp parallel for
for (size_t i = 0; i < devices.size(); ++i)
{
const int device = devices[i];
thrust::copy(thrust::device, // execution policy
device_buffers->At(device)->begin(), // thrust::device_vector
device_buffers->At(device)->end(),
elements->begin() + (device * block_size)); // thrust::host_vector
}
After reading the documentation and the following post, I understand that the default thrust::execution_policy is chosen based on the iterators that are passed.
When copying data from device to host, both iterators are passed as function parameters.
1. Which execution policy is picked here per default? thrust::host
or thrust::device
?
After doing some benchmarks, I observe that passing thrust::device
explicitly improves performance, compared to not passing an explicit
parameter.
2. What could be the reason for the performance gain? The system is a POWER9 machine. How does thrust::copy and the specific execution
policy work internally? How many of the 4 copy engines of each
device are actually used?
However, nvprof does not display the [CUDA memcpy DtoH] category anymore and instead shows void thrust::cuda_cub::core [...] __parallel_for::ParallelForAgent [...] which even shows an increase in Time (s). This does not make sense because, as I said, I observed a consistent performance improvement (smaller total execution time) when using thrust::device.
3. Is this just a nvprof + thrust-specific behaviour that causes profiling numbers not to correlate with acutal execution time? I observed something similiar for cudaFree: It seems that cudaFree is returning control to the host code pretty fast which results in small execution time while nvprof shows much higher numbers because the actual deallocation probably happens in lazy fashion.
The Thrust doc on the thrust::device
states the following:
Raw pointers allocated by host APIs should not be mixed with a thrust::device algorithm invocation when the device backend is CUDA
To my understanding, this means that host-device copy with thrust::device
execution policy is invalid, in the first place, unless the host memory is pinned.
We imply that your host allocation is not pinned, BUT: One possibility is that on POWER9 with NVLINK you may be lucky that any host-allocated memory is addressable from within the GPU. Thanks to that, host-device copy with thrust::device
works, though it should not.
On a regular system, host memory is addressable from within a GPU only if this host memory is allocated with cudaMallocHost
(pinned). So, the question is whether your POWER system has automagically upgraded all allocations to be pinned. Is the observed performance bonus due to the implicitly-pinned memory, or would you get an additional speedup, if allocations are also done with cudaMallocHost
explicitly?
Another Thrust design-based evidence is that thrust::device
policy has par.on(stream)
support, while thrust::host
does not. This is pretty much aligned with the fact that asynchronous host-device copies are only possible with the pinned memory.