I've got two ways to allow my thrust functor to access global non-vectorizable non-uniformly-accessed read-only state. Unfortunately there is a 100-fold difference in kernel execution time. Why would there be any difference in my two strategies?
And more generally: Is there a canonical way to provide a thrust functor with access to these kinds of globals?
My first way is to put a copy of my global data into the functor. The thrust machinery appears to perform upload and caching on the device:
// functor containing a copy of array dependency
template<size_t BARSIZE>
struct foo1_func
{
__align__(16) float bar[BARSIZE];
foo1_func(float _bar[BARSIZE]) { memcpy(bar,_bar,BARSIZE*sizeof(float)); }
__host__ __device__ operator()(float &t) { t = do_something(t, bar); }
}
Called using thrust::for_each...
// assuming barData is a float[]
foo<N>(barData);
My second way is to perform the uploading to the device myself using thrust::copy and just pass device-memory pointers of the uploaded data to my functor. This method appears to be much slower:
// functor containing device pointers to array in GMEM
struct foo2_func
{
float *bar;
foo2_func(float* _bar) { bar = bar; }
__host__ __device__ operator()(float &t) { t = do_something(t, bar); }
}
Called using thrust::for_each...
// assuming d_bar is a thrust::device_vector
foo(thrust::raw_pointer_cast(d_bar.data()));
Links to sources that illustrate canonical or unique functor patterns gratefully accepted.
With the first way, you are actually trying to put the whole array bar
to GPU registers by passing the struct foo1_func
as a kernel function parameter.
__global__ void kernel_generated_by_thrust(struct foo_func f, ...) {
float x = f.bar[3];
...
}
If the size of bar
is small enough to be put in the resisters, random access to bar
is actually the random access to register.
But your second way only passed a global memory pointer through the struct. So random access to bar
is the random access to global memory.
That's why the second way is much slower.
Both ways have their use cases. You could choose either one depending on what you want to achieve, the size of your bar
and how much register you want to spend on caching the bar
.