I'm developing a kernel function with several vector operations like scalar and vector products. The kernel uses a large amount of registers so that occupancy is very low. I'm trying to reduce the amount of used registers to improve occupancy.
Consider for example the following __device__
function performing a scalar product between two float3
:
__device__ float dot(float3 in1, float3 in2) { return in1.x * in2.x + in1.y * in2.y + in1.z * in2.z; }
If I generate the .ptx
file using
nvcc -ptx -gencode arch=compute_52,code=sm_52 -rdc=true simpleDot2.cu
(the file simpleDot2.cu
contains only the definition of the __device__
function), I essentially obtain
// .globl _Z3dot6float3S_
.visible .func (.param .b32 func_retval0) _Z3dot6float3S_(
.param .align 4 .b8 _Z3dot6float3S__param_0[12],
.param .align 4 .b8 _Z3dot6float3S__param_1[12]
)
{
.reg .f32 %f<10>;
ld.param.f32 %f1, [_Z3dot6float3S__param_0+8];
ld.param.f32 %f2, [_Z3dot6float3S__param_0];
ld.param.f32 %f3, [_Z3dot6float3S__param_0+4];
ld.param.f32 %f4, [_Z3dot6float3S__param_1+8];
ld.param.f32 %f5, [_Z3dot6float3S__param_1];
ld.param.f32 %f6, [_Z3dot6float3S__param_1+4];
mul.f32 %f7, %f3, %f6;
fma.rn.f32 %f8, %f2, %f5, %f7;
fma.rn.f32 %f9, %f1, %f4, %f8;
st.param.f32 [func_retval0+0], %f9;
ret;
}
From the .ptx
code, it seems that a number of 9
registers are used, which perhaps can be lowered. I understand that the .ptx
code is not the ultimate code executed by a GPU.
Question
Is there any chance to rearrange the register usage in the .ptx
code, for example recycling registers f1
-f6
, so to reduce the overall number of occupied registers?
Thank you very much for any help.
TL;DR To first order, no.
PTX
is both a virtual ISA and a compiler intermediate representation. The registers used in PTX code are virtual registers and there is no fixed relation to the physical registers of the GPU. The PTX code generated by the CUDA toolchain follows the SSA (static single assignment) convention. This means that every virtual register is written to exactly once. Stated differently: When an instruction produces a result it is assigned to a new register. This means that longer kernels may use thousands of registers.
In the CUDA toolchain, PTX code is compiled to machine code (SASS) by the ptxas
component. So despite the name, this is not an assembler, but an optimizing compiler that can do loop unrolling, CSE (common subexpression elimination), and so on. Most importantly, ptxas
is responsible for register allocation and instruction scheduling, plus all optimizations specific to a particular GPU architecture.
As a consequence, any examination of register usage issues needs to be focused on the machine code, which can be extracted with cuobjdump --dump-sass
. Furthermore, the programmer has very limited influence on the number of registers used, because ptxas
uses numerous heuristics when determining register allocation, in particular to trade off register usage with performance: scheduling loads early tends to increases register pressure by extension of the life range, so does the creation of temporary variable during CSE or the creation of induction variable for strength reduction in loops.
Modern versions of CUDA that target compute capability of 3.0 and higher usually make excellent choices when determining these trade-offs, and it is rarely necessary for programmers to consider register pressure. It is not clear what motivates asker's question in this regard.
The documented mechanisms in CUDA to control maximum register usage are the -maxrregcount
command-line flag of nvcc
, which applies to an entire compilation unit, and the __launch_bounds__
attribute that allows control on a per-kernel basis. See the CUDA documentation for details. Beyond that, one can try to influence register usage by choosing the pxtas
optimization level with -Xptxas -O{1|2|3}
(default is -O3
), or by re-arranging source code, or by use of compiler flags that tend to simplify the generated code, such as -use_fast_math
.
Of course such indirect methods could have numerous other effects that are generally unpredictable, and any desirable result achieved will be "brittle", e.g. easily destroyed by changing to a new version of the toolchain.