Profiling CUDA code: Unexpected instruction counts on coalesced memory reads

I am profiling a very dump sorting algorithm for small input data (= 512 elements). I am invoking a kernel that reads coalesced form an array of structs.

The struct looks like this:

struct __align__(8) Elements 
{
     float weight;
     int value;
};

The nvprof delivers the following instruction counts for L1 miss/hits and gdl instructions:

                  Invocations    Avg       Min       Max  Event Name
        Kernel: sort(Elements*)
                      500         0         0         0  gld_inst_8bit
                      500         0         0         0  gld_inst_16bit
                      500      1024      1024      1024  gld_inst_32bit
                      500         0         0         0  gld_inst_64bit
                      500         0         0         0  gld_inst_128bit
                      500       120       120       120  l1_global_load_hit
                      500       120       120       120  l1_global_load_miss
                      500         0         0         0  uncached_global_load_tr.

If I change the layout of the struct as followed:

struct __align__(8) Elements 
{
     float weight;
     float value;
};

The profiling output looks like this:

                  Invocations    Avg       Min       Max  Event Name
Device 0
        Kernel: sort(Elements*)
                      500         0         0         0  gld_inst_8bit
                      500         0         0         0  gld_inst_16bit
                      500         0         0         0  gld_inst_32bit
                      500       512       512       512  gld_inst_64bit
                      500         0         0         0  gld_inst_128bit
                      500         0         0         0  l1_global_load_hit
                      500       120       120       120  l1_global_load_miss
                      500         0         0         0  uncached_global_load_tr.

There is no inpact on the execution time at all but i don't understand why the GPU performs 32 bit load instructions on the first variant of the code and 64 bit instructions on the second.

The kernel is invoked wiht 1 block and 512 threads (so l1_global_load_x counters may be incorrect). All takes place on a GeForce 480 with CUDA 5.0.

EDIT: The sort kernel (a little shortened):

__global__ void sort(Elements* nearest)
{
    ThreadIndex idx = index();

    __shared__ Elements temp[MAX_ELEMENTS];
    __shared__ int index_cache[MAX_ELEMENTS];

    temp[idx.x] = nearest[idx.x];

    WeightedElements elem = temp[idx.x];
    __syncthreads();

    int c = 0;

    // some index crunching 

    nearest[idx.x] = temp[c];
}

Solution

The basic reason for this is down to code generation by the compiler. PTX assembler has different virtual register state spaces for floating point and integer, and it (I think) isn't possible to perform a 64 bit load into two registers in different state spaces. So the compiler emits two 32 bit loads in the mixed integer/float struct, but can emit a 64 bit vector load into two registers in the float/float struct case.

This can be illustrated by considering the following model of your code:

struct __align__(8) ElementsB 
{
    float weight;
    float value;
};

struct __align__(8) ElementsA 
{
    float weight;
    int value;
};

template<typename T>
__global__ void kernel(const T* __restrict__ in, T* __restrict__ out, bool flag)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    T ival = in[idx];
    if (flag) {
        out[idx] = ival;
    }
}


template __global__ void kernel<ElementsA>(const ElementsA *, ElementsA *, bool);
template __global__ void kernel<ElementsB>(const ElementsB *, ElementsB *, bool);

Here we have the two structures you mentioned, and a simple templated kernel instantiated for both types. If we look at the PTX emitted by the compiler for sm_20 (CUDA 5.0 release compiler), the differences are obvious. For the ElementsA instance:

    ld.param.u32    %r4, [_Z6kernelI9ElementsAEvPKT_PS1_b_param_0];
    ld.param.u32    %r5, [_Z6kernelI9ElementsAEvPKT_PS1_b_param_1];
    ld.param.u8     %rc1, [_Z6kernelI9ElementsAEvPKT_PS1_b_param_2];
    cvta.to.global.u32      %r1, %r5;
    cvta.to.global.u32      %r6, %r4;
    .loc 2 16 1
    mov.u32         %r7, %ntid.x;
    mov.u32         %r8, %ctaid.x;
    mov.u32         %r9, %tid.x;
    mad.lo.s32      %r2, %r7, %r8, %r9;
    .loc 2 18 1
    shl.b32         %r10, %r2, 3;
    add.s32         %r11, %r6, %r10;
    ld.global.u32   %r3, [%r11+4];  // 32 bit integer load
    ld.global.f32   %f1, [%r11];  // 32 bit floating point load

(comments added for emphasis)

and for the Element B instance:

    ld.param.u32    %r3, [_Z6kernelI9ElementsBEvPKT_PS1_b_param_0];
    ld.param.u32    %r4, [_Z6kernelI9ElementsBEvPKT_PS1_b_param_1];
    ld.param.u8     %rc1, [_Z6kernelI9ElementsBEvPKT_PS1_b_param_2];
    cvta.to.global.u32      %r1, %r4;
    cvta.to.global.u32      %r5, %r3;
    .loc 2 16 1
    mov.u32         %r6, %ntid.x;
    mov.u32         %r7, %ctaid.x;
    mov.u32         %r8, %tid.x;
    mad.lo.s32      %r2, %r6, %r7, %r8;
    .loc 2 18 1
    shl.b32         %r9, %r2, 3;
    add.s32         %r10, %r5, %r9;
    ld.global.v2.f32        {%f9, %f10}, [%r10];  // 64 bit float2 load

The reason there is no performance penalty between the two is that the underlying hardware uses 128 byte fetches for coalesced warp level loads, and in both cases the transactions result in the same pair of 128 byte fetches.