Search code examples
cudanvcc

CUDA stack frame size increase by __forceinline__


When I declare device functions with __forceinline__, the linker outputs this information:

2>  nvlink : info : Function properties for '_ZN3GPU4Flux4calcILj512EEEvv':
2>  nvlink : info : used 28 registers, 456 stack, 15776 bytes smem, 320 bytes cmem[0], 0 bytes lmem

and without it the output is:

2>  nvlink : info : Function properties for '_ZN3GPU4Flux4calcILj512EEEvv':
2>  nvlink : info : used 23 registers, 216 stack, 15776 bytes smem, 320 bytes cmem[0], 0 bytes lmem

Why is the size of the stack frame smaller when the __forceinline__ is not used? How important is to keep the stack frame as small as possible? Thank you for your help.


Solution

  • The main reason to reduce the stack frame is that the stack is allocated in local memory which resides in off-chip device memory. This makes the access to the stack (if not cached) slow.

    To show this, let me make a simple example. Consider the case:

    __device__ __noinline__ void func(float* d_a, float* test, int tid) {
        d_a[tid]=test[tid]*d_a[tid];
    }
    
    __global__ void kernel_function(float* d_a) {
        float test[16];
        test[threadIdx.x] = threadIdx.x;
        func(d_a,test,threadIdx.x);
    }
    

    Note that the __device__ function is declared __noinline__. In this case

    ptxas : info : Function properties for _Z15kernel_functionPf
        64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas : info : Used 7 registers, 36 bytes cmem[0]
    

    i.e., we have 64 bytes of stack frame. The corresponding disassembled code is

    MOV R1, c[0x1][0x100];
    ISUB R1, R1, 0x40;
    S2R R6, SR_TID.X;                    R6 = ThreadIdx.x
    MOV R4, c[0x0][0x20];
    IADD R5, R1, c[0x0][0x4];
    I2F.F32.U32 R2, R6;                  R2 = R6 (integer to float conversion)              
    ISCADD R0, R6, R1, 0x2;
    STL [R0], R2;                        stores R2 to test[ThreadIdx.x]                                
    CAL 0x50; 
    EXIT ;                               __device__ function part
    ISCADD R2, R6, R5, 0x2;
    ISCADD R3, R6, R4, 0x2;
    LD R2, [R2];                         loads d_a[tid]
    LD R0, [R3];                         loads test[tid]
    FMUL R0, R2, R0;                     d_a[tid] = d_a[tid]*test[tid]
    ST [R3], R0;                         store the new value of d_a[tid] to global memory
    RET ;
    

    As you can see, test is stored and loaded from global memory, forming the stack frame (it is 16 floats = 64 bytes).

    Now change the device function to

    __device__ __forceinline__ void func(float* d_a, float* test, int tid) {
        d_a[tid]=test[tid]*d_a[tid];
    }
    

    that is, change the __device__ function from __noinline__ to __forceinline__. In this case, we have

    ptxas : info : Compiling entry function '_Z15kernel_functionPf' for 'sm_20'
    ptxas : info : Function properties for _Z15kernel_functionPf
        0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    

    i.e., we have an empty stack frame now. Indeed, the disassembled code becomes:

    MOV R1, c[0x1][0x100];               
    S2R R2, SR_TID.X;                    R2 = ThreadIdx.x
    ISCADD R3, R2, c[0x0][0x20], 0x2;    
    I2F.F32.U32 R2, R2;                  R2 = R2 (integer to float conversion)
    LD R0, [R3];                         R2 = d_a[ThreadIdx.x] (load from global memory)
    FMUL R0, R2, R0;                     d_a[ThreadIdx.x] = d_a[ThreadIdx.x] * ThreadIdx.x
    ST [R3], R0;                         stores the new value of d_a[ThreadIdx.x] to global memory
    EXIT ;
    

    As you can see, forcing the inlining enables the compiler to perform proper optimizations so that now test is fully discarded from the code.

    In the above example, __forceinline__ has an effect that is opposite to what you are experiencing, which also shows that, without any further information, the first question cannot be answered.