Thrust: transform_reduce : cudaMalloc in unary_op.operator

In my unary_op.operator, I need to create a temporary array.
I guess cudaMalloc is the way to go.
But, is it performance efficient or is there a better design?

struct my_unary_op
{
    __host__ __device__ int operator()(const int& index) const
    {
        int* array;
        cudaMalloc((void**)&array, 10*sizeof(int));

        for(int i = 0; i < 10; i++)
            array[i] = index;

        int sum=0;
        for(int i=0; i < 10 ; i++)
            sum += array[i];

        return sum;
    };

};
int main()
{
    thrust::counting_iterator<int> first(0);
    thrust::counting_iterator<int> last = first+100;

    my_unary_op unary_op = my_unary_op();

    thrust::plus<int> binary_op;

    int init = 0;
    int sum = thrust::transform_reduce(first, last, unary_op, init, binary_op);

    return 0;
};

Solution

You won't be able to compile cudaMalloc() in a __device__ function, because it is a host-only function. You can, however, use plain malloc() or new (on devices of compute capability >= 2.0), but these are not very efficient when running on the device. There are two reasons for this. The first is that concurrently running threads are serialized during the memory allocation call. The second is that the calls allocate global memory in chunks that become arranged in such a way that when the memory load and store instructions are run by the 32 threads in a warp, they are not adjacent, so you don't get properly coalesced memory accesses.

You can address both of these issues by using fixed size C style arrays in your __device__ functions (ie., int array[10];). Small, fixed size arrays can sometimes be optimized by the compiler so that they are stored in the register file, for extremely fast access. If the compiler stores them in global memory, it will use local memory. Local memory is stored in global memory, but it is interleaved in such a way that when the 32 threads in a warp run a load or store instruction, each thread accesses adjacent locations in memory, enabling the transactions to be fully coalesced.

If you don't know at runtime what the size of your C arrays will be, allocate a max size in the array and leave some of it unused.

I think that the total amount of memory that is used by the fixed sized array will depend on the total number of threads that are processed concurrently on the GPU, not on the total number of threads launched by the kernel. In this answer @mharris shows how to calculate the maximum possible number of concurrent threads, which is 24,576 for a GTX580. So, if the fixed size array is 16 32-bit values, the maximum possible amount of memory used by the array would be 1536KiB.

If you need a wide range of array sizes, you can use templates to compile kernels with a number of different sizes. Then, at runtime, select one that is able to accommodate the size that you need. However, chances are that if you simply allocate the maximum of what you might need, the memory usage will not be the limiting factor in the number of threads that you can launch.