I'm running a Thrust parallelized binary search-type routine on an array:
// array and array2 are raw pointers to device memory
thrust::device_ptr<int> array_ptr(array);
// Search for first position where 0 could be inserted in array
// without violating the ordering
thrust::device_vector<int>::iterator iter;
iter = thrust::lower_bound(array_ptr, array_ptr+length, 0, cmp(array2));
A custom function object cmp
defines a custom comparison operator:
struct cmp
{
cmp(int *array2){ this->array2 = array2; }
__device__ bool operator()(const int& x, const int& y)
{
return device_function(array2,x) <= device_function(array2,y);
}
int *array2;
};
The comparison relies on a call to a function compiled on the device:
__device__ int device_function( const int* array2, const int value ){
int quantity = 0;
for (int i = 0; i < 50000; ++i){
if ( array2[i] > value ){ quantity += array2[i]; }
}
return quantity;
}
My question is: what (if any) parallel execution is done on the device for the sum-reduction in device_function
? If the function executes serially as such, how can I introduce parallelism to speed up the function evaluation?
My question is: what (if any) parallel execution is done on the device for the sum-reduction in device_function?
None. Ordinary C/C++ code in a __device__
function (whether in CUDA or Thrust) executes sequentially, from the context of a single CUDA thread.
If the function executes serially as such, how can I introduce parallelism to speed up the function evaluation?
One possible approach is to use Thrust v1.8 (available from github or from CUDA 7 RC) and place an appropriate thrust function in the functor (cmp
) that you pass to thrust::lower_bound
.
Here is a worked example of using thrust::sort
from within a custom functor passed to another thrust function.
Parallelization using this method requires compilation for, and execution on, a device that supports CUDA Dynamic Parallelism. And there is no guarantee of overall speed-up, just as with any CUDA Dynamic Parallelism code. Whether or not this level of parallelism will provide any benefit will depend on a number of factors, such as whether or not the previous level of parallelism was already maximally utilizing the device, or not.
For example purposes, it appears that the function contained in your device_function
could be replaced by a single call to thrust::transform_reduce
. Your cmp
function could then be rewritten as something like this (coded in browser, not tested):
struct cmp
{
cmp(int *array2){ this->array2 = array2; }
__device__ bool operator()(const int& x, const int& y)
{
return (thrust::transform_reduce(thrust::device, array2,array2+50000, my_greater_op(x), 0, thrust::plus<int>()) <= thrust::transform_reduce(thrust::device, array2,array2+50000, my_greater_op(y), 0, thrust::plus<int>()));
}
int *array2;
};
and you would have to provide an appropriate my_greater_op
functor:
struct my_greater_op
{
int val;
my_greater_op(int _val) {val = _val;}
__host__ __device__ int operator(const int& x)
{
return (x>val)?x:0;
}
};