memory-management cuda initialization zero

How do I best initialize a local-memory array to 0?

(There are similar questions about device (global) memory arrays, e.g. my own question.)

Suppose I have a CUDA kernel code going like this:

my_arr[MyCompileTimeConstant];

/* ... */

for(unsigned i = 0; i < foo(); i++) {
   my_arr[bar(i)] += some_value;
}

Now, I want to initialize my_arr to all-zeros before I start adding to its entries. Can I do any better than the trivial loop

for(unsigned i = 0; i < MyCompileTimeConstant; i++) {
   my_arr[i] = 0;
}

Note: I specifically made the loop range and array size constants known in compile-time. The question would have been slightly different had they been passed at run-time. Of course, it may not change the answer for CUDA like it does for code running on the CPU

Solution

A simple loop should be the "best" approach (but see final comment below). Using the following kernel as an example:

template<int version>
__global__
void tkernel(int *A, int *B, int *C, int n)
{
    int biglocal[100];

    switch(version) {
        case 1:
            for(int i=0; i<100; i++) {
                biglocal[i] = 0;
            };

            break;

        case 2:
            memset(&biglocal[0], 0, 100*sizeof(int));
            break;


        case 3:
            const int4 zero = {0, 0, 0, 0};
            int4 *p = reinterpret_cast<int4*>(&biglocal[0]);
#pragma unroll
            for(int i=0; i<100/4; i++) {
                p[i] = zero;
            }

            break;
    }

    if (n>0) {
        for(int i=0; i<100; i++) {
            biglocal[A[threadIdx.x*i]] += B[threadIdx.x*i];
        }
        C[threadIdx.x] = biglocal[n];
    }
}

template __global__ void tkernel<1>(int *, int *, int *, int);
template __global__ void tkernel<2>(int *, int *, int *, int);
template __global__ void tkernel<3>(int *, int *, int *, int);

Here we have three different ways to zero a large local memory array, plus some code to convince the compiler that the whole initialisation sequence and local array shouldn't be optimised away.

Looking at the PTX emitted for compute 2.1 targets with the CUDA 6 release compiler, both versions 1 & 3 look like this:

.local .align 4 .b8     __local_depot0[400];
.reg .b64   %SP;
.reg .b64   %SPL;
.reg .pred  %p<3>;
.reg .s32   %r<67>;
.reg .s64   %rd<73>;


mov.u64     %SPL, __local_depot0;
ld.param.u64    %rd4, [_Z7tkernelILi1EEvPiS0_S0_i_param_0];
ld.param.u64    %rd5, [_Z7tkernelILi1EEvPiS0_S0_i_param_1];
ld.param.u64    %rd6, [_Z7tkernelILi1EEvPiS0_S0_i_param_2];
ld.param.u32    %r21, [_Z7tkernelILi1EEvPiS0_S0_i_param_3];
add.u64     %rd7, %SPL, 0;
mov.u32     %r66, 0;
st.local.u32    [%rd7], %r66;
st.local.u32    [%rd7+4], %r66;
st.local.u32    [%rd7+8], %r66;
st.local.u32    [%rd7+12], %r66;
st.local.u32    [%rd7+16], %r66;
st.local.u32    [%rd7+20], %r66; 

    // etc

ie. the compiler unrolled the loop and emitted a string of 32 bit store instructions. The int4 trick in version 3 produced identical code as the simple loop, which is a little surprising. Version 2, however, gets this:

.local .align 4 .b8     __local_depot1[400];
.reg .b64   %SP;
.reg .b64   %SPL;
.reg .pred  %p<4>;
.reg .s16   %rs<2>;
.reg .s32   %r<66>;
.reg .s64   %rd<79>;


mov.u64     %SPL, __local_depot1;
ld.param.u64    %rd7, [_Z7tkernelILi2EEvPiS0_S0_i_param_0];
ld.param.u64    %rd8, [_Z7tkernelILi2EEvPiS0_S0_i_param_1];
ld.param.u64    %rd9, [_Z7tkernelILi2EEvPiS0_S0_i_param_2];
ld.param.u32    %r21, [_Z7tkernelILi2EEvPiS0_S0_i_param_3];
add.u64     %rd11, %SPL, 0;
mov.u64     %rd78, 0;

BB1_1:
add.s64     %rd12, %rd11, %rd78;
mov.u16     %rs1, 0;
st.local.u8     [%rd12], %rs1;
add.s64     %rd78, %rd78, 1;
setp.lt.u64 %p1, %rd78, 400;
@%p1 bra    BB1_1;

ie. a loop which is performing 8 bit writes (comments indicate that simple list initialisation will also yield this type of copy loop). The latter will be a lot slower that then former. Apart from the size difference of the stores, the unrolled stream of writes are fully independent and could be issued in whatever order will keep the instruction pipeline full, and should lead to higher instruction throughput. I don't believe it will be possible to beat the compiler in the unrolled case, and a simple loop looks to yield the same code as a simple attempt at vectorization. If you were really keen, I guess you could try inline PTX to generate wider stores. I don't know whether there would be any performance advantage in doing so.