Optimizing register usage in dot product

I'm developing a kernel function with several vector operations like scalar and vector products. The kernel uses a large amount of registers so that occupancy is very low. I'm trying to reduce the amount of used registers to improve occupancy.

Consider for example the following __device__ function performing a scalar product between two float3:

__device__ float dot(float3 in1, float3 in2) { return in1.x * in2.x + in1.y * in2.y + in1.z * in2.z; }

If I generate the .ptx file using

nvcc -ptx -gencode arch=compute_52,code=sm_52 -rdc=true simpleDot2.cu

(the file simpleDot2.cu contains only the definition of the __device__ function), I essentially obtain

    // .globl   _Z3dot6float3S_
.visible .func  (.param .b32 func_retval0) _Z3dot6float3S_(
    .param .align 4 .b8 _Z3dot6float3S__param_0[12],
    .param .align 4 .b8 _Z3dot6float3S__param_1[12]
)
{
    .reg .f32   %f<10>;


    ld.param.f32    %f1, [_Z3dot6float3S__param_0+8];
    ld.param.f32    %f2, [_Z3dot6float3S__param_0];
    ld.param.f32    %f3, [_Z3dot6float3S__param_0+4];
    ld.param.f32    %f4, [_Z3dot6float3S__param_1+8];
    ld.param.f32    %f5, [_Z3dot6float3S__param_1];
    ld.param.f32    %f6, [_Z3dot6float3S__param_1+4];
    mul.f32     %f7, %f3, %f6;
    fma.rn.f32  %f8, %f2, %f5, %f7;
    fma.rn.f32  %f9, %f1, %f4, %f8;
    st.param.f32    [func_retval0+0], %f9;
    ret;
}

From the .ptx code, it seems that a number of 9 registers are used, which perhaps can be lowered. I understand that the .ptx code is not the ultimate code executed by a GPU.

Question

Is there any chance to rearrange the register usage in the .ptx code, for example recycling registers f1-f6, so to reduce the overall number of occupied registers?

Thank you very much for any help.

Solution

TL;DR To first order, no.

PTX is both a virtual ISA and a compiler intermediate representation. The registers used in PTX code are virtual registers and there is no fixed relation to the physical registers of the GPU. The PTX code generated by the CUDA toolchain follows the SSA (static single assignment) convention. This means that every virtual register is written to exactly once. Stated differently: When an instruction produces a result it is assigned to a new register. This means that longer kernels may use thousands of registers.

In the CUDA toolchain, PTX code is compiled to machine code (SASS) by the ptxas component. So despite the name, this is not an assembler, but an optimizing compiler that can do loop unrolling, CSE (common subexpression elimination), and so on. Most importantly, ptxas is responsible for register allocation and instruction scheduling, plus all optimizations specific to a particular GPU architecture.

As a consequence, any examination of register usage issues needs to be focused on the machine code, which can be extracted with cuobjdump --dump-sass. Furthermore, the programmer has very limited influence on the number of registers used, because ptxas uses numerous heuristics when determining register allocation, in particular to trade off register usage with performance: scheduling loads early tends to increases register pressure by extension of the life range, so does the creation of temporary variable during CSE or the creation of induction variable for strength reduction in loops.

Modern versions of CUDA that target compute capability of 3.0 and higher usually make excellent choices when determining these trade-offs, and it is rarely necessary for programmers to consider register pressure. It is not clear what motivates asker's question in this regard.

The documented mechanisms in CUDA to control maximum register usage are the -maxrregcount command-line flag of nvcc, which applies to an entire compilation unit, and the __launch_bounds__ attribute that allows control on a per-kernel basis. See the CUDA documentation for details. Beyond that, one can try to influence register usage by choosing the pxtas optimization level with -Xptxas -O{1|2|3} (default is -O3), or by re-arranging source code, or by use of compiler flags that tend to simplify the generated code, such as -use_fast_math.

Of course such indirect methods could have numerous other effects that are generally unpredictable, and any desirable result achieved will be "brittle", e.g. easily destroyed by changing to a new version of the toolchain.