cuda: device function inlining and different .cu files

Two facts: CUDA 5.0 lets you compile CUDA code in different objects files for linking later on. CUDA architecture 2.x no longer inlines functions automatically.

As usual in C/C++, I've implemented a function __device__ int foo() in functions.cu and placed its header in functions.hu. The function foo is called in other CUDA source files.

When I examine functions.ptx, I see that foo() spills to local memory. For testing purposes, I commented all of the meat of foo() and just made it return 1; Something still spills to local memory according to the .ptx. (I can't imagine what it is, since the function does nothing!)

However, when I move the implementation of foo() to the header file functions.hu and add the __forceinline__ qualifier, then nothing is written to local memory!

What is going on here? Why doesn't CUDA inline such a simple function automatically?

The whole point of separate header & implementation files is to make my life easier maintaining the code. But if I have to stick a bunch of functions (or all of them) in the header and __forceinline__ them, then it kind of defeats the purpose of CUDA 5.0's different compilation units...

Is there any way around this?

Simple, real example:

functions.cu:

__device__  int  foo
        (const uchar param0,
        const uchar *const param1,
        const unsigned short int param2,
        const unsigned short int param3,
        const uchar param4) 
{    
    return 1; //real code commented out.
}

The above function spills to local memory.

functions.ptx:

.visible .func  (.param .b32 func_retval0) _Z45fooPKhth(
        .param .b32 _Z45foohPKhth_param_0,
        .param .b64 _Z45foohPKhth_param_1,
        .param .b32 _Z45foohPKhth_param_2,
        .param .b32 _Z45foohPKhth_param_3
)
{
        .local .align 8 .b8     __local_depot72[24];
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .s16       %rc<3>;
        .reg .s16       %rs<4>;
        .reg .s32       %r<2>;
        .reg .s64       %rd<2>;

Solution

Not all local memory usage represents spilling. Called functions need to follow the ABI calling conventions which includes creation of a stack frame which is in local memory. When nvcc is passed the commandline switch -Xptxas -v the compiler reports stack usage and spilling as a subcomponent thereof.

Currently (CUDA 5.0), the CUDA toolchain does not support function inlining across the boundaries of compilation units, like some host compilers do. Thus there is a tradeoff between the flexibility of separate compilation (such as re-compiling only a small part of a large project with lengthy compile times, and the possibility to create device-side libraries), and the performance gain that usually results from function inlining (e.g. elimination of overhead due to ABI calling convention, enabling additional optimization such as constant propgation across function boundaries).

Function inlining within a single compilation unit is controlled by compiler heuristics that try to dertermine whether inlining is likely profitable in terms of performance (if possible at all). This means that not all functions may be inlined. Programmers can override the heuristic with the function attributes __forcinline__ and __noinline__.