I am profiling a very dump sorting algorithm for small input data (= 512 elements). I am invoking a kernel that reads coalesced form an array of structs.
The struct looks like this:
struct __align__(8) Elements
{
float weight;
int value;
};
The nvprof delivers the following instruction counts for L1 miss/hits and gdl instructions:
Invocations Avg Min Max Event Name
Kernel: sort(Elements*)
500 0 0 0 gld_inst_8bit
500 0 0 0 gld_inst_16bit
500 1024 1024 1024 gld_inst_32bit
500 0 0 0 gld_inst_64bit
500 0 0 0 gld_inst_128bit
500 120 120 120 l1_global_load_hit
500 120 120 120 l1_global_load_miss
500 0 0 0 uncached_global_load_tr.
If I change the layout of the struct as followed:
struct __align__(8) Elements
{
float weight;
float value;
};
The profiling output looks like this:
Invocations Avg Min Max Event Name
Device 0
Kernel: sort(Elements*)
500 0 0 0 gld_inst_8bit
500 0 0 0 gld_inst_16bit
500 0 0 0 gld_inst_32bit
500 512 512 512 gld_inst_64bit
500 0 0 0 gld_inst_128bit
500 0 0 0 l1_global_load_hit
500 120 120 120 l1_global_load_miss
500 0 0 0 uncached_global_load_tr.
There is no inpact on the execution time at all but i don't understand why the GPU performs 32 bit load instructions on the first variant of the code and 64 bit instructions on the second.
The kernel is invoked wiht 1 block and 512 threads (so l1_global_load_x counters may be incorrect). All takes place on a GeForce 480 with CUDA 5.0.
EDIT: The sort kernel (a little shortened):
__global__ void sort(Elements* nearest)
{
ThreadIndex idx = index();
__shared__ Elements temp[MAX_ELEMENTS];
__shared__ int index_cache[MAX_ELEMENTS];
temp[idx.x] = nearest[idx.x];
WeightedElements elem = temp[idx.x];
__syncthreads();
int c = 0;
// some index crunching
nearest[idx.x] = temp[c];
}
The basic reason for this is down to code generation by the compiler. PTX assembler has different virtual register state spaces for floating point and integer, and it (I think) isn't possible to perform a 64 bit load into two registers in different state spaces. So the compiler emits two 32 bit loads in the mixed integer/float struct, but can emit a 64 bit vector load into two registers in the float/float struct case.
This can be illustrated by considering the following model of your code:
struct __align__(8) ElementsB
{
float weight;
float value;
};
struct __align__(8) ElementsA
{
float weight;
int value;
};
template<typename T>
__global__ void kernel(const T* __restrict__ in, T* __restrict__ out, bool flag)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
T ival = in[idx];
if (flag) {
out[idx] = ival;
}
}
template __global__ void kernel<ElementsA>(const ElementsA *, ElementsA *, bool);
template __global__ void kernel<ElementsB>(const ElementsB *, ElementsB *, bool);
Here we have the two structures you mentioned, and a simple templated kernel instantiated for both types. If we look at the PTX emitted by the compiler for sm_20 (CUDA 5.0 release compiler), the differences are obvious. For the ElementsA
instance:
ld.param.u32 %r4, [_Z6kernelI9ElementsAEvPKT_PS1_b_param_0];
ld.param.u32 %r5, [_Z6kernelI9ElementsAEvPKT_PS1_b_param_1];
ld.param.u8 %rc1, [_Z6kernelI9ElementsAEvPKT_PS1_b_param_2];
cvta.to.global.u32 %r1, %r5;
cvta.to.global.u32 %r6, %r4;
.loc 2 16 1
mov.u32 %r7, %ntid.x;
mov.u32 %r8, %ctaid.x;
mov.u32 %r9, %tid.x;
mad.lo.s32 %r2, %r7, %r8, %r9;
.loc 2 18 1
shl.b32 %r10, %r2, 3;
add.s32 %r11, %r6, %r10;
ld.global.u32 %r3, [%r11+4]; // 32 bit integer load
ld.global.f32 %f1, [%r11]; // 32 bit floating point load
(comments added for emphasis)
and for the Element B
instance:
ld.param.u32 %r3, [_Z6kernelI9ElementsBEvPKT_PS1_b_param_0];
ld.param.u32 %r4, [_Z6kernelI9ElementsBEvPKT_PS1_b_param_1];
ld.param.u8 %rc1, [_Z6kernelI9ElementsBEvPKT_PS1_b_param_2];
cvta.to.global.u32 %r1, %r4;
cvta.to.global.u32 %r5, %r3;
.loc 2 16 1
mov.u32 %r6, %ntid.x;
mov.u32 %r7, %ctaid.x;
mov.u32 %r8, %tid.x;
mad.lo.s32 %r2, %r6, %r7, %r8;
.loc 2 18 1
shl.b32 %r9, %r2, 3;
add.s32 %r10, %r5, %r9;
ld.global.v2.f32 {%f9, %f10}, [%r10]; // 64 bit float2 load
The reason there is no performance penalty between the two is that the underlying hardware uses 128 byte fetches for coalesced warp level loads, and in both cases the transactions result in the same pair of 128 byte fetches.