My kernel has the ptx version like this:
.version 2.2
.target sm_20, texmode_independent
.entry histogram(
.param .u32 .ptr .global .align 4 histogram_param_0,
.param .u32 .ptr .global .align 4 histogram_param_1
)
{
.reg .f32 %f<2>;
.reg .s32 %r<12>;
_histogram:
mov.u32 %r1, %tid.x;
mov.u32 %r2, %envreg3;
add.s32 %r3, %r1, %r2;
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, %ntid.x;
mad.lo.s32 %r6, %r4, %r5, %r3;
shl.b32 %r7, %r6, 2;
ld.param.u32 %r8, [histogram_param_0];
add.s32 %r9, %r8, %r7;
ld.param.u32 %r10, [histogram_param_1];
ld.global.f32 %f1, [%r9];
add.s32 %r11, %r10, %r7;
st.global.f32 [%r11], %f1;
ret;
}
I as I counted, there are only 13 instructions in my kernel (not including the ret instruction). When I set the number of work items to be 5120, workgroup size is 64. Because there are 16 SMs, in each of which there are 32 scalar processors, so the above code will be executed 10 times in a SM. As I expected the number of executed instructions should be 10*13 = 130. But after I profiled, the results are: issued instructions=130, executed intructions=100. 1. Why is the number of issued instructions different with the number of executed instructions? There is no branches, so aren't they supposed to be equal? 2. Why is the number of executed instruction smaller than expected? Should all the instructions in the ptx version executed at least? 3. Does cache misses (L1 and L2) have any impact on the number of issued instructions and the number of executed instructions? Thanks
PTX is only an intermediate representation of compiled code. It is not what the GPU actually executes. There is a further assembly step which emits the code which the GPU runs, this can happen either at compile time, or using JIT compilation in the driver. As a result, your instruction counts and anything you infer from them are invalid.
NVIDIA ship a tool called cuobjdump
which can disassemble the assembler output generated for Fermi cards and show the actual machine code run on the GPU