Search code examples
randomcudathrustcurand

cuRAND performs much worse than thrust when generating random numbers inside CUDA kernels


I am trying to generate "random" numbers from a uniform distribution inside a CUDA __global__ kernel using two different approaches. The first is using the cuRAND device API, and the second is using thrust. For each approach I have created a different class.

Here is my cuRAND solution:

template<typename T>
struct RNG1
{
    __device__
    RNG1(unsigned int tid) {
        curand_init(tid, tid, 0, &state);
    }

    __device__ T
    operator ()(void) {
        return curand_uniform(&state);
    }

    curandState state;
};

And here is my thrust solution:

template<typename T>
struct RNG2
{
    __device__
    RNG2(unsigned int tid)
        : gen(tid)
        , dis(0, 1) { gen.discard(tid); }

    __device__ T
    operator ()(void) {
        return dis(gen);
    }

    thrust::default_random_engine gen;
    thrust::uniform_real_distribution<T> dis;
};

The way I use them is the following:

template<typename T> __global__ void
mykernel(/* args here */)
{
    unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

    RNG1<T> rng(tid);
    // or
    RNG2<T> rng(tid);

    T a_random_number = rng();  

    // do stuff here
}

Both of them work but the cuRAND solution is much slower (more than 3 times slower). If I set the second parameter of curand_init (sequence number) to 0, then the performance is the same as that of the thrust solution, but the random numbers are "bad". I can see patterns and artefacts in the resulting distribution.

Here are my two questions:

  • Can someone explain to me why the cuRAND solution with a non-zero sequence is slower?
  • How can thrust be as fast as cuRAND with zero sequence, but also generate good random numbers?
  • While searching on Google, I noticed that most people use cuRAND, and very few use thrust to generate random numbers inside device code. Is there something I should be aware of? Am I misusing thrust?

Thank you.


Solution

  • Perhaps the performance difference happens because cuRAND and Thrust use different PRNG algorithms with different performance profiles and demands on memory. Note that cuRAND supports five different PRNG algorithms, and your code doesn't give which one is in use.

    Thrust's default_random_engine is currently minstd_rand, but its documentation notes that this "may change in a future version". (A comment written after I wrote mine also noted that it's minstd_rand.) minstd_rand is a simple linear congruential generator that may be faster than whatever PRNG cuRAND is using.

    This was a comment converted to an answer and edited.