c function performance pointers overhead

C - efficiently changing a function pointer based on command line input

I have several similar functions, say A, B, C. I want to choose one of them with command line options. Also, I'm calling that function billion times because of that instead of checking a variable inside a function billion times, I'm defining a function pointer Phi and set it to desired function just one time. But when I set, Phi = A, (so no user input considered) my code runs in ~24 secs, when I add an if-else and set Phi to desired function, my code runs in ~30 secs with exact same parameters. (Of course command line option sets Phi to A) What is the efficient way to handle this case?

My functions:

double funcA(double r)
{
    return 0;
}

double funcB(double r)
{
    return 1;
}

double funcC(double r)
{
    return r;
}

void computationFunctionFast(Context *userInputs) {
    double (*Phi)(double) = funcA;
    
    /* computation codes */
}

void computationFunctionSlow(Context *userInputs) {
    double (*Phi)(double);
    switch (userInputs->funcEnum) {
        case A:
        Phi = funcA;
        break;

        case B:
        Phi = funcB;
        break;

        case C:
        Phi = funcC;
    }
    
    /* computation codes */
}

I've tried gcc, clang, icx with -O2 and -O3 optimizations. (gcc has no performance difference in mentioned cases but has the worst performance) Although I'm using C, I've tried std::function too. I've tried defining Phi function in different scopes etc.

Solution

Generally, there are a few things here that are slightly bad for performance:

Branches/comparisons lead to inefficient use of branch prediction/instruction cache and might affect pipelining too.
Function pointers are notoriously inefficient since they generally block inlining and generally the compiler can't do much about them.

Here's an example based on your code:

double computationFunctionSlow (int input, double val) {
    double (*Phi)(double);
    switch (input) {
        case 0:  Phi = funcA; break;
        case 1:  Phi = funcB; break;
        case 2:  Phi = funcC; break;
    }
    double res = Phi(val);
    return res;
}

clang 15.0.0 x86_64 -O3 gives:

computationFunctionSlow:                # @computationFunctionSlow
        cmp     edi, 2
        ja      .LBB3_1
        movsxd  rax, edi
        lea     rcx, [rip + .Lswitch.table.computationFunctionSlow]
        jmp     qword ptr [rcx + 8*rax]         # TAILCALL
.LBB3_1:
        xorps   xmm0, xmm0
        ret
.Lswitch.table.computationFunctionSlow:
        .quad   funcA
        .quad   funcB
        .quad   funcC

Even though the numbers I picked are adjacent, the usual compilers fail to optimize out the comparison cmp. Even when I include a default: return 0; it is still there. You can quite easily manually optimize any switch with contiguous indices like this into a function pointer jump table:

double computationFunctionSlow (int input, double val) {
    double (*Phi[3])(double) = {funcA, funcB, funcC};
    double res = Phi[input](val);
    return res;
}

clang 15.0.0 x86_64 -O3 gives:

computationFunctionSlow:                # @computationFunctionSlow
        movsxd  rax, edi
        lea     rcx, [rip + .L__const.computationFunctionSlow.Phi]
        jmp     qword ptr [rcx + 8*rax]         # TAILCALL
.L__const.computationFunctionSlow.Phi:
        .quad   funcA
        .quad   funcB
        .quad   funcC

This leads to slightly better code here as the comparison instruction/branch is now removed. However, this is really a micro optimization that shouldn't have that much impact of performance. You have to benchmark it for sure to see if there's any improvement.

(Also gcc 12.2 didn't optimize this code as good, why I went with clang for this example.)

Godbolt link: https://godbolt.org/z/ja4zerj7o