Segmentation fault for thrust::host execution policy

I try to copy data from host to device and back, but not with the CUDA API but the thrust library. I allocated memory in a thrust::host_vector, and try to copy it to a thrust::device_vector. However, when using thrust::copy with the thrust::host execution policy for any data transfer from host <-> device, the program crashes with a segmentation fault. Cuda-memcheck provides the following error message:

Error: process didn't terminate successfully
The application may have hit an error when dereferencing Unified Memory from the host.

The documentation on what the thrust::host and thrust::device execution policies actually do and what constraints are to be taken into account when using them is pretty scarce.

What are potential causes for thrust::copy not to work with the thrust::host execution policy? Note that not specifying the parameter explicitly works fine. The machine that I am working on is a POWER9 machine.

Here is a small reproducible example: Build with nvcc -O3 -std=c++11 -Xcompiler -fopenmp test.cu -o test

#include <vector>
#include <omp.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

#define NUM_GPUS 4

int main(int argc, char *argv[]) {

    size_t num_elements = 10000;
    size_t block_size = num_elements/4;

    thrust::host_vector<int> hvec(num_elements);

    std::vector<thrust::device_vector<int>*> dvecs(NUM_GPUS);
    
    #pragma omp parallel for
    for (size_t i = 0; i < NUM_GPUS; ++i)
    {
        cudaSetDevice(i);

        dvecs[i] = new thrust::device_vector<int>(block_size);

        thrust::copy(   thrust::host,
                        hvec.begin() + (block_size * i),
                        hvec.begin() + (block_size * (i + 1)),
                        dvecs[i]->begin());
    }

    return 0;
}


nvcc: NVIDIA (R) Cuda compiler driver
Cuda compilation tools, release 10.2, V10.2.89

gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)

Solution

You shouldn't use an execution policy of either thrust::host or thrust::device when using thrust::copy to copy data between host and device.

The reason for this is fairly evident by reading the documentation for thrust::device (for example).

Instead of relying on implicit algorithm dispatch through iterator system tags, users may directly target algorithm dispatch at Thrust's device system by providing thrust::device as an algorithm parameter.

Similar wording and intent is provided for thrust::host

Of course, this is not what you want when copying data between host and device. You are depending on inspection of the iterators to determine the direction of transfer, among other things. Passing thrust::host means that thrust can interpret both addresses (ultimately the iterators are reduced to addresses used by a copy operation) as if they were valid host addresses, and therefore do a host->host copy. If one of those addresses is a device address, that will lead to a seg fault.

Passing thrust::device means that thrust can interpret both addresses as if they were valid device addresses, and therefore do a device->device copy. If one of those addresses is a host address, that will lead to either an invalid parameter error or an illegal address error (if the copy is implemented via kernel. In my test I happen to see the illegal address error).

The above is certainly the behavior I would expect to see on a non-Power9 system. If you believe you should see something different on a Power9 system, you may wish to file a thrust issue. However, passing an execution policy for this algorithm seems nonsensical to me, regardless of the platform.