Search code examples
performancememorygpuopenclopencl-c

OpenCL How efficient is the use of user-defined function in kernel code in terms of memory and performance


In OpenCL C kernel code, Default built-in functions are good, but what about user-defined functions? do they have any performance and memory decrease when compared with in-built ones? If so, does writing the said user-defined function inside __kernel void once or multiple times better?

For Example:-

gentype clamp ( gentype x,
gentype minval,
gentype maxval)

The Above is an In-built function that has no impact on Performance nor does it reduce gpu l0/l1 cache memory

By user-defined function I mean like this below

int Add(int a, int b)
{
   return a + b;
}

do these functions have any impact on l0/l1 memory if so then is it better to Not write these as functions and instead use the code everywhere?


Solution

  • I usually inline all functions, except if they are very lengthy and are called many times within a kernel. For example

    float __attribute__((always_inline)) sq(const float x) {
        return x*x;
    }
    

    for computing the square of x. Inlined functions come at no additional computational cost for the function calling itself. However if you inline a very long function many times in a kernel, the assembly blows up and spills into global memory, resultuing in a loss of performance. In this case, the overhead due to function call is negligible compared to the execution time of the function itself. Finally, if you don't explicitely inline a very short function, the compiler will do it automatically in most cases. Same as for functions is true for loop unrolling with #pragma unroll.

    Regarding the math functions, most of them directly relate to the hardware, with a few exceptions. For example, the count leading zeroes function int y = clz(x);, despite being translated into the clz PTX instruction, has no dedicated hardware and is slower than emulating it with int y = 31-(int)(as_uint((float)x)>>23);. Similarly, although the inverse square root rsqrt(x) is executed in hardware,

    float __attribute__((always_inline)) fast_rsqrt(const float x) {
        return as_float(0x5F37642F-(as_int(x)>>1));
    }
    

    runs slightly faster but is less accurate. In most cases the built-in math functions are the best option though.