Search code examples
openclnvidianvccptx

Why doesn't OpenCL Nvidia compiler (nvcc) use the registers twice?


I'm doing a small OpenCL benchmark using Nvidia drivers, my kernel performs 1024 fuse multiply-adds and store the result in an array:

#define FLOPS_MACRO_1(x)    { (x) = (x) * 0.99f + 10.f; } // Multiply-add
#define FLOPS_MACRO_2(x)    { FLOPS_MACRO_1(x) FLOPS_MACRO_1(x) }
#define FLOPS_MACRO_4(x)    { FLOPS_MACRO_2(x) FLOPS_MACRO_2(x) }
#define FLOPS_MACRO_8(x)    { FLOPS_MACRO_4(x) FLOPS_MACRO_4(x) }
// more recursive macros ...
#define FLOPS_MACRO_1024(x) { FLOPS_MACRO_512(x) FLOPS_MACRO_512(x) }

__kernel void ocl_Kernel_FLOPS(int iNbElts, __global float *pf)
{
   for (unsigned i = get_global_id(0); i < iNbElts; i += get_global_size(0))  
   {
      float f = (float) i;
      FLOPS_MACRO_1024(f)
      pf[i] = f;
    }   
}

But when I look in the PTX generated, I see this:

    .entry ocl_Kernel_FLOPS(
    .param .u32 ocl_Kernel_FLOPS_param_0,
    .param .u32 .ptr .global .align 4 ocl_Kernel_FLOPS_param_1
)
{
    .reg .f32   %f<1026>; // 1026 float registers !
    .reg .pred  %p<3>;
    .reg .s32   %r<19>;    

    ld.param.u32    %r1, [ocl_Kernel_FLOPS_param_0];
    // some more code unrelated to the problem
    // ...

BB1_1:
    and.b32     %r13, %r18, 65535;
    cvt.rn.f32.u32  %f1, %r13;
    fma.rn.f32  %f2, %f1, 0f3F7D70A4, 0f41200000;
    fma.rn.f32  %f3, %f2, 0f3F7D70A4, 0f41200000;
    fma.rn.f32  %f4, %f3, 0f3F7D70A4, 0f41200000;
    fma.rn.f32  %f5, %f4, 0f3F7D70A4, 0f41200000;
    // etc
    // ...

If I am correct, the PTX uses 1026 float registers to perform the 1024 operations and never reuse a register twice even if it could perform all the multiply-add operations using only 2 registers. 1026 is far above the maximum number of registers a thread is allow to have (according to the specs), so I guess this ends up in memory spilling.

Is it a compiler bug or am I totally missing something ?

I am using nvcc version 6.5 on a Quadro K2000 GPU.

EDIT

Actually I did miss something in the specs:

"Since PTX supports virtual registers, it is quite common for a compiler frontend to generate a large number of register names. Rather than require explicit declaration of every name, PTX supports a syntax for creating a set of variables having a common prefix string appended with integer suffixes. For example, suppose a program uses a large number, say one hundred, of .b32 variables, named %r0, %r1, ..., %r99"


Solution

  • The PTX file format is intended to describe a virtual machine and instruction set architecture:

    PTX defines a virtual machine and ISA for general purpose parallel thread execution. PTX programs are translated at install time to the target hardware instruction set. The PTX-to-GPU translator and driver enable NVIDIA GPUs to be used as programmable parallel computers.

    So the PTX output that you are obtaining there is not a form of "GPU assembler". It is only an intermediate representation, intended to be capable of describing virtually any form of parallel computation.

    The PTX representation is then compiled into actual binaries for the respective target GPU. This is important in order to be possible to abstract from the actual architecture - specifically, regarding your example: It should be possible to use the same PTX representation of a program, regardless of the number of registers that are available on a specific target machine. The 1026 "registers" that you see there are "virtual" registers, and in the end, may be mapped to the (few) real hardware registers that are actually available. You may add the --ptxas-options=-v argument to the NVCC during the compilation to obtain addition information about the register usage.

    (This is roughly the same idea as that behind the LLVM - namely, to have a representation that can be optimized on and argued about, both abstracting from the original source code and from the actual target architecture).