So I'm trying to make use of this custom RNG library for openCL: http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html
The library defines a state struct:
//! Represents the state of a particular generator
typedef struct{ uint x; uint c; } mwc64x_state_t;
And in order to generate a random uint, you pass in the state into the following function:
uint MWC64X_NextUint(mwc64x_state_t *s)
which updates the state, so that when you pass it into the function again, the next "random" number in the sequence will be generated.
For the project I am creating I need to be able to generate random numbers not just in different work groups/items but also across multiple devices simultaneously and I'm having trouble figuring out the best way to design this. Like should I create 1 mwc64x_state_t object per device/commandqueue and pass that state in as a global variable? Or is it possible to create 1 state object for all devices at once? Or do I not even pass it in as a global variable and declare a new state locally within each kernel function?
The library also comes with this function:
void MWC64X_SeedStreams(mwc64x_state_t *s, ulong baseOffset, ulong perStreamOffset)
Which supposedly is supposed to split up the RNG into multiple "streams" but including this in my kernel makes it incredibly slow. For instance, if I do something very simple like the following:
__kernel void myKernel()
{
mwc64x_state_t rng;
MWC64X_SeedStreams(&rng, 0, 10000);
}
Then the kernel call becomes around 40x slower.
The library does come with some source code that serves as example usages but the example code is kind of limited and doesn't seem to be that helpful.
So if anyone is familiar with RNGs in openCL or if you've used this particular library before I'd very much appreciate your advice.
The MWC64X_SeedStreams function is indeed relatively slow, at least in comparison to the MWC64X_NextUint call, but this is true of most parallel RNGs that try to split a large global stream into many sub-streams that can be used in parallel. The assumption is that you'll be calling NextUint many times within the kernel (e.g. a hundred or more), but SeedStreams is only at the top.
This is an annotated version of the EstimatePi example that comes with with the library (mwc64x/test/estimate_pi.cpp and mwc64x/test/test_mwc64x.cl).
__kernel void EstimatePi(ulong n, ulong baseOffset, __global ulong *acc)
{
// One RNG state per work-item
mwc64x_state_t rng;
// This calculates the number of samples that each work-item uses
ulong samplesPerStream=n/get_global_size(0);
// And then skip each work-item to their part of the stream, which
// will from stream offset:
// baseOffset+2*samplesPerStream*get_global_id(0)
// up to (but not including):
// baseOffset+2*samplesPerStream*(get_global_id(0)+1)
//
MWC64X_SeedStreams(&rng, baseOffset, 2*samplesPerStream);
// Now use the numbers
uint count=0;
for(uint i=0;i<samplesPerStream;i++){
ulong x=MWC64X_NextUint(&rng);
ulong y=MWC64X_NextUint(&rng);
ulong x2=x*x;
ulong y2=y*y;
if(x2+y2 >= x2)
count++;
}
acc[get_global_id(0)] = count;
}
So the intent is that n should be large and grow as the number of work items grow, so that samplesPerStream remains around a hundred or more.
If you want multiple kernels on multiple devices, then you need to add another level of hierarchy to the stream splitting, so for example if you have:
You end up with N=KWC total calls to NextUint across all work-items. If your devices are identified as k=0..(K-1), then within each kernel you would do:
MWC64X_SeedStreams(&rng, W*C*k, C);
Then the indices within the stream would be:
[0 .. N ) : Parts of stream used across all devices
[k*(W*C) .. (k+1)*(W*C) ) : Used within device k
[k*(W*C)+(i*C) .. (k*W*C)+(i+1)*C ) : Used by work-item i in device k.
It is fine if each kernel uses less than C samples, you can over-estimate if necessary.
(I'm the author of the library).