Will template function typedef specifier be properly inlined when creating each instance of template function?

Have made function that operates on several streams of data in same time, creates output result which is put to destination stream. It has been put huge amount of time to optimize performance of this function (openmp, intrinsics, and etc...). And it performs beautifully. There is alot math involved here, needless to say very long function.

Now I want to implement in same function with math replacement code for each instance of this without writing each version of this function. Where I want to differentiate between different instances of this function using only #defines or inlined function (code has to be inlined in each version).

Went for templates, but templates allow only type specifiers, and realized that #defines can't be used here. Remaining solution would be inlined math functions, so simplified idea is to create header like this:

'alm_quasimodo.h':

#pragma once

typedef struct ALM_DATA
{
  int l, t, r, b;
  int scan;
  BYTE* data;  
} ALM_DATA;

typedef BYTE (*MATH_FX)(BYTE&, BYTE&);
// etc

inline BYTE math_a1(BYTE& A, BYTE& B){ return ((BYTE)((B > A) ? B:A)); }
inline BYTE math_a2(BYTE& A, BYTE& B){ return ((BYTE)(255 - ((long)((long)(255 - A) * (255 - B)) >> 8))); }
inline BYTE math_a3(BYTE& A, BYTE& B){ return ((BYTE)((B < 128)?(2*(((long)A>>1)+64))*((float)B/255):(255-(2*(255-(((long)A>>1)+64))*(float)(255-B)/255)))); }
// etc

template <typename MATH>
inline int const template_math_av (MATH math, ALM_DATA& a, ALM_DATA& b) 
{ 
  // ultra simplified version of very complex code
  for (int y = a.t; y <= a.b; y++)
  {
    int yoffset = y * a.scan;
    for (int x = a.l; x <= a.r; x++)
    {
      int xoffset = yoffset + x;
      a.data[xoffset] = math(a.data[xoffset], b.data[xoffset]);
    }
  }
  return 0;
}

ALM_API int math_caller(int condition, ALM_DATA& a, ALM_DATA& b);

and math_caller is defined in 'alm_quasimodo.cpp' as follows:

#include "stdafx.h"
#include "alm_quazimodo.h"

ALM_API int math_caller(int condition, ALM_DATA& a, ALM_DATA& b)
{
  switch(condition)
  {
    case 1: return template_math_av<MATH_FX>(math_a1, a, b);
      break;
    case 2: return template_math_av<MATH_FX>(math_a2, a, b);
      break;
    case 3: return template_math_av<MATH_FX>(math_a3, a, b);
      break;
    // etc
  }
  return -1;
}

Main concern here is optimization, mainly in-lining of MATH function code, and not to break existing optimizations of original code. Without writing each instance of function for specific math operation, of course ;)

So does this template inlines properly all math functions? And any suggestions how to optimize this function template?

If nothing, thanks for reading this lengthy question.

Solution

It all depends on your compiler, optimization level, and how and where are math_a1 to math_a3 functions defined. Usually, the compiler can optimize this if the functions in question are inline function in the same compilation unit as the rest of the code. If this doesn't happen for you, you may want to consider functors instead of functions.

Here are some simple examples I experimented with. You can do the same for your function, and check the behavior of different compilers.

For my example, GCC 7.3 and clang 6.0 are pretty good in optimizing-out function calls (provided they see the definition of the function of course). However, somewhat surprisingly, ICC 18.0.0 is only able to optimize-out functors and closures. Even inline functions give it some trouble.

Just to have some code here in case the link stops working in the future. For the following code:

template <typename T, int size, typename Closure>
T accumulate(T (&array)[size], T init, Closure closure) {
    for (int i = 0; i < size; ++i) {
        init = closure(init, array[i]);
    }
    return init;
}

int sum(int x, int y) { return x + y; }
inline int sub_inline(int x, int y) { return x - y; }
struct mul_functor {
    int operator ()(int x, int y) const  { return x * y; }
};
extern int extern_operation(int x, int y);

int accumulate_function(int (&array)[5]) {
    return accumulate(array, 0, sum);
}
int accumulate_inline(int (&array)[5]) {
    return accumulate(array, 0, sub_inline);
}
int accumulate_functor(int (&array)[5]) {
    return accumulate(array, 1, mul_functor());
}
int accumulate_closure(int (&array)[5]) {
    return accumulate(array, 0, [](int x, int y) { return x | y; });
}
int accumulate_exetern(int (&array)[5]) {
    return accumulate(array, 0, extern_operation);
}

GCC 7.3 (x86) produces the following assembly:

sum(int, int):
        lea     eax, [rdi+rsi]
        ret
accumulate_function(int (&) [5]):
        mov     eax, DWORD PTR [rdi+4]
        add     eax, DWORD PTR [rdi]
        add     eax, DWORD PTR [rdi+8]
        add     eax, DWORD PTR [rdi+12]
        add     eax, DWORD PTR [rdi+16]
        ret
accumulate_inline(int (&) [5]):
        mov     eax, DWORD PTR [rdi]
        neg     eax
        sub     eax, DWORD PTR [rdi+4]
        sub     eax, DWORD PTR [rdi+8]
        sub     eax, DWORD PTR [rdi+12]
        sub     eax, DWORD PTR [rdi+16]
        ret
accumulate_functor(int (&) [5]):
        mov     eax, DWORD PTR [rdi]
        imul    eax, DWORD PTR [rdi+4]
        imul    eax, DWORD PTR [rdi+8]
        imul    eax, DWORD PTR [rdi+12]
        imul    eax, DWORD PTR [rdi+16]
        ret
accumulate_closure(int (&) [5]):
        mov     eax, DWORD PTR [rdi+4]
        or      eax, DWORD PTR [rdi+8]
        or      eax, DWORD PTR [rdi+12]
        or      eax, DWORD PTR [rdi]
        or      eax, DWORD PTR [rdi+16]
        ret
accumulate_exetern(int (&) [5]):
        push    rbp
        push    rbx
        lea     rbp, [rdi+20]
        mov     rbx, rdi
        xor     eax, eax
        sub     rsp, 8
.L8:
        mov     esi, DWORD PTR [rbx]
        mov     edi, eax
        add     rbx, 4
        call    extern_operation(int, int)
        cmp     rbx, rbp
        jne     .L8
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret