Search code examples
cudagpunsightnsight-compute

CUDA math function register usage


I am trying to understand the significant register usage incurred when using a few of the built-in CUDA math ops like atan2() or division and how the register usage might be reduced/eliminated.

I'm using the following program:

#include <stdint.h>
#include <cuda_runtime.h>

extern "C" {
    __global__ void kernel(float* out) {
        uint32_t n = threadIdx.x + blockIdx.x*blockDim.x;
        out[n] = atan2f(static_cast<float>(n), 2.0f);
    }
}

int main(int argc, char const* argv[]) {
    float* d_ary;
    cudaMalloc(&d_ary, 32);
    kernel<<<1,32>>>(d_ary);
    float ary[32];
    cudaMemcpy(ary, d_ary, 32, cudaMemcpyDeviceToHost);
}

and building it with:

nvcc -arch=sm_80 -Xptxas="-v" kernel.cu

Profiling the kernel produces results in the image attached below.

The massive spike in register usage occurs when atan2() is called (or some function within atan2), increasing the register count by more than 100. As far as I can tell this seems to be due to the fact that atan2() is not inlined. Is there any way to get these more expensive floating point operations to get inlined other than resorting to compiler flags like use_fast_math?

enter image description here

EDIT:

@njuffa pointed out that the function call causing the register spike is associated with a slow path taken within atan2 which calls into an internal CUDA function that is not inlined. After some testing the register spike seems to be associated with any non-inlined function call (CALL.ABS.NOINC). Any device function decorated with __noinline__ results in the same phenomenon. Further, nested __noinline__ calls result in the live register count reported by Nsight increasing even further, up to the cap of 255.


Solution

  • I posted about this on the Nsight Computer forums and was informed that it is a bug and will be fixed in a future release.

    https://forums.developer.nvidia.com/t/contraditory-register-count-report-when-calling-a-non-inlined-function/259908