cuda misaligned address on thrust device_vector resize

This is odd... thrust::device_vector.resize throws with cudaErrorMisalignedAddress, but only if I first call curandGenerateNormal with a start address not aligned to 8 bytes:

#include <cuda_runtime.h>
#include <curand.h>
#include <thrust/device_vector.h>
#include <assert.h>

int main()
{
    thrust::device_vector<float> a(16), b(0);

    curandGenerator_t _prng;
    curandStatus_t curandStat = curandCreateGenerator(&_prng, CURAND_RNG_PSEUDO_DEFAULT);
    assert(curandStat == CURAND_STATUS_SUCCESS);

    bool breakCUDA = true;

    if (breakCUDA) {
        // this curand call (not 8-byte aligned) somehow breaks subsequent resize
        float *start_p1 = a.data().get() + 1;
        curandStat = curandGenerateNormal(_prng, start_p1, 2, 0.0f, 1.0f);
        assert(curandStat == CURAND_STATUS_SUCCESS);
    }
    else {
         // this one, using an 8-byte aligned pointer works fine
         float *start = a.data().get();
         curandStat = curandGenerateNormal(_prng, start, 2, 0.0f, 1.0f);
         assert(curandStat == CURAND_STATUS_SUCCESS);
    }

    // note: either call above returns CURAND_STATUS_SUCCESS

    // but this throws thrust::system_error with error cudaErrorMisalignedAddress
    // if the unaligned pointer was used before
    b.resize(16);
}

In my real code I need to use different generation parameters (the 0.0f, 1.0f) on different segments of the first vector, and the segment boundaries are not necessarily memory aligned.

The doc for curandGenerateNormal says the length has to be even (as it is in both cases) but doesn't mention anything about alignment.

I have a workaround now: I check if the pointer I'm about to pass to curandGenerateNormal is aligned to 8 bytes and if not I generate to some temporary memory and copy it. But I'd appreciate it if anyone has any more insight into what is going on so I can make sure I do the right thing in the future. Are there any other thrust or curand methods where I have to be careful about alignment?

This is CUDA 6.5 on Windows.

Thanks.

Solution

I think the fundamental issue is that curandGenerateNormal is expecting to write a quantity that is aligned to twice the fundamental data type (float, in this case). Therefore, the pointer you pass to curandGenerateNormal, when using a PRNG such as the default XORWOW generator, should be aligned to twice the data type (i.e. 8-byte aligned in this case, or 16-byte aligned in the case of curandGenerateNormalDouble, for example). I don't believe the issue has anything to do with thrust.

Although the issue is not well documented that I can see, a hint of it may be found in the documentation you linked:

Normally distributed results are generated from pseudorandom generators with a Box-Muller transform, and so require n to be even.

Let's consider a slightly different test case, to prove that thrust is not at issue, and to take a look at what is going on under the hood:

$ cat t625.cu
#include <curand.h>
#include <iostream>
#define DSIZE 4

int main(){

  curandGenerator_t _prng;
  curandStatus_t curandStat = curandCreateGenerator(&_prng, CURAND_RNG_PSEUDO_DEFAULT);
  float *h_a, *d_a;
  h_a = (float *)malloc(DSIZE*sizeof(float));
  cudaMalloc(&d_a, DSIZE*sizeof(float));
  cudaMemset(d_a, 0, DSIZE*sizeof(float));
  float *start_p1 = d_a+ 1;
  curandStat = curandGenerateNormal(_prng, start_p1, 2, 0.0f, 1.0f);
  cudaMemcpy(h_a, d_a, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
  for (int i = 0; i < DSIZE; i++)
    std::cout << h_a[i] << std::endl;
  return 0;
}
[user2@dc20 misc]$ vi t625.cu
[user2@dc20 misc]$ nvcc -arch=sm_20 -o t625 t625.cu -lcurand
[user2@dc20 misc]$ cuda-memcheck ./t625
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 8
=========     at 0x000003e8 in void gen_sequenced<curandStateXORWOW, float2, normal_args_st, __operator_&__(float2 curand_normal_scaled2<curandStateXORWOW>(curandStateXORWOW*, normal_args_st))>(curandStateXORWOW*, float2*, unsigned long, unsigned long, normal_args_st)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x13047c0004 is misaligned
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x14ad95]
=========     Host Frame:/usr/local/cuda/lib64/libcurand.so.6.5 [0x726d8]
=========     Host Frame:/usr/local/cuda/lib64/libcurand.so.6.5 [0x9b923]
=========     Host Frame:/usr/local/cuda/lib64/libcurand.so.6.5 [0xfc95]
=========     Host Frame:/usr/local/cuda/lib64/libcurand.so.6.5 (curandGenerateNormal + 0x1ee7) [0x3b987]
=========     Host Frame:./t625 [0x27a2]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xfd) [0x1ecdd]
=========     Host Frame:./t625 [0x2639]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaMemcpy.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so.1 [0x2ef613]
=========     Host Frame:./t625 [0x37fdf]
=========     Host Frame:./t625 [0x27c2]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xfd) [0x1ecdd]
=========     Host Frame:./t625 [0x2639]
=========
1.14162e-37
0
7.40782e-38
0
========= ERROR SUMMARY: 2 errors
$

(I am working in linux, but I wouldn't expect any difference between windows and linux here.)

The above program generates basically the same error. Therefore we can conclude that thrust is not necessary to see the problem. Taking a closer look at the cuda-memcheck output, we see:

========= Invalid __global__ write of size 8
=========     at 0x000003e8 in void gen_sequenced<curandStateXORWOW, float2, normal_args_st, __operator_&__(float2 curand_normal_scaled2<curandStateXORWOW>(curandStateXORWOW*, normal_args_st))>(curandStateXORWOW*, float2*, unsigned long, unsigned long, normal_args_st)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x13047c0004 is misaligned

The gen_sequenced is a kernel call that is contained within the host API function curandGenerateNormal. It is attempting to write an 8-byte quantity, which must (by CUDA requirement) be on an 8-byte aligned boundary. As you've already indicated, the pointer being passed is 4-byte aligned but not 8-byte aligned, in the failing case. Furthermore, we see that this kernel under the hood is using a float2 quantity. This is undoubtedly an optimization done since it's known that the quantity n must be even. A float2 quantity can only be accessed on an 8-byte boundary.

The conclusion therefore, although it doesn't seem to be explicitly documented, seems to be that for the cases covered by this statement:

Normally distributed results are generated from pseudorandom generators with a Box-Muller transform, and so require n to be even.

the pointer passed must be aligned to twice the fundamental datatype. I will file a notice with NVIDIA to request that the documentation be updated.

Regarding error reporting, the actual error that occurs (misaligned pointer) as detected by the CUDA kernel will be detected asynchronously, and will not be reported at the time of kernel launch (the gen_sequenced kernel, in this case). It will be reported subsequently at some future point, when the CUDA error status is checked. This may explain why the curand function itself returns a positive result. Thrust has runtime error handling built in, so a previously occurring CUDA error of this type will be "caught" by Thrust and reported, even though (as in this case) it may have nothing to do with Thrust, per se.