Consider these 3 trivial, minimal kernels. Their register usage is much higher than I expect. Why?
A:
__global__ void Kernel_A()
{
//empty
}
corresponding ptx:
ptxas info : Compiling entry function '_Z8Kernel_Av' for 'sm_20'
ptxas info : Function properties for _Z8Kernel_Av
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 32 bytes cmem[0]
B:
template<uchar effective_bank_width>
__global__ void Kernel_B()
{
//empty
}
template
__global__ void Kernel_B<1>();
corresponding ptx:
ptxas info : Compiling entry function '_Z8Kernel_BILh1EEvv' for 'sm_20'
ptxas info : Function properties for _Z8Kernel_BILh1EEvv
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 32 bytes cmem[0]
C:
template<uchar my_val>
__global__ void Kernel_C
(uchar *const device_prt_in,
uchar *const device_prt_out)
{
//empty
}
corresponding ptx:
ptxas info : Compiling entry function '_Z35 Kernel_CILh1EEvPhS0_' for 'sm_20'
ptxas info : Function properties for _Z35 Kernel_CILh1EEvPhS0_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 10 registers, 48 bytes cmem[0]
Question:
Why did empty kernels A and B use 2 registers? CUDA always uses one implicit register, but why are 2 additional explicit registers used?
Kernel C is even more frustrating. 10 registers? But there are only 2 pointers. This gives 2*2 = 4 registers for the pointers. Even if there are additionally 2 mysterious registers (suggested by Kernel A and Kernel B), this would give 6 total. Still much less than 10 !
In case you are interested, here is the ptx
code for Kernel A. The ptx
code for Kernel B is exactly the same, modulo the integer values and variable names.
.visible .entry _Z8Kernel_Av(
)
{
.loc 5 19 1
func_begin0:
.loc 5 19 0
.loc 5 19 1
func_exec_begin0:
.loc 5 22 2
ret;
tmp0:
func_end0:
}
And for Kernel C...
.weak .entry _Z35Kernel_CILh1EEvPhS0_(
.param .u64 _Z35Kernel_CILh1EEvPhS0__param_0,
.param .u64 _Z35Kernel_CILh1EEvPhS0__param_1
)
{
.local .align 8 .b8 __local_depot2[16];
.reg .b64 %SP;
.reg .b64 %SPL;
.reg .s64 %rd<3>;
.loc 5 38 1
func_begin2:
.loc 5 38 0
.loc 5 38 1
mov.u64 %SPL, __local_depot2;
cvta.local.u64 %SP, %SPL;
ld.param.u64 %rd1, [_Z35Kernel_CILh1EEvPhS0__param_0];
ld.param.u64 %rd2, [_Z35Kernel_CILh1EEvPhS0__param_1];
st.u64 [%SP+0], %rd1;
st.u64 [%SP+8], %rd2;
func_exec_begin2:
.loc 5 836 2
tmp2:
ret;
tmp3:
func_end2:
}
.local
) ?.reg .b64
lines. But what is the .reg .s64
line? Why is it there?It gets worse still:
D:
template<uchar my_val>
__global__ void Kernel_D
(uchar * device_prt_in,
uchar *const device_prt_out)
{
device_prt_in = device_prt_in + blockIdx.x*blockDim.x + threadIdx.x;
}
gives
ptxas info : Used 6 registers, 48 bytes cmem[0]
So manipulating the argument (a pointer) decreases from 10 to 6 registers?
The first point to make is that if you are worried about registers, don't look at PTX code, because it won't tell you anything. PTX uses static single assignment form and the code emitted by the compiler doesn't include any of the "decoration" required to make a runnable machine code entry point.
With that out of the way, let's look at kernel A:
$ nvcc -arch=sm_20 -m64 -cubin -Xptxas=-v null.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z8Kernel_Av' for 'sm_20'
ptxas info : Function properties for _Z8Kernel_Av
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 32 bytes cmem[0]
$ cuobjdump -sass null.cubin
code for sm_20
Function : _Z8Kernel_Av
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x00001de780000000*/ EXIT;
.............................
There are your two registers. Empty kernels don't produce zero instructions.
Beyond that, I can't reproduce what you have shown. If I look at your kernel C as posted, I get this (CUDA 5 release compiler):
$ nvcc -arch=sm_20 -m64 -cubin -Xptxas=-v null.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z8Kernel_CILh1EEvPhS0_' for 'sm_20'
ptxas info : Function properties for _Z8Kernel_CILh1EEvPhS0_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 48 bytes cmem[0]
$ cuobjdump -sass null.cubin
code for sm_20
Function : _Z8Kernel_CILh1EEvPhS0_
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x00001de780000000*/ EXIT;
........................................
ie. identical 2 register code to the first two kernels.
and the same for Kernel D:
$ nvcc -arch=sm_20 -m64 -cubin -Xptxas=-v null.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z8Kernel_DILh1EEvPhS0_' for 'sm_20'
ptxas info : Function properties for _Z8Kernel_DILh1EEvPhS0_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 48 bytes cmem[0]
$ cuobjdump -sass null.cubin
code for sm_20
Function : _Z8Kernel_DILh1EEvPhS0_
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x00001de780000000*/ EXIT;
........................................
Again, 2 registers.
For the record, the nvcc version I am using is:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_28_16:10:16_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221