.net performance delegates c++-cli mixed-mode

Performance of C++/CLI function pointers versus .NET delegates

For my C++/CLI project I just tried to measure the cost of C++/CLI function pointers versus .NET delegates.

My expectation was, that C++/CLI function pointers are faster than .NET delegates. So my test separately counts the number of invocations of the .NET delegate and native function pointer throughout 5 seconds.

Results

Now the results were (and still are) shocking to me:

.NET delegate: 910M executions with result 152080413333030 in 5003ms
Function pointer: 347M executions with result 57893422166551 in 5013ms

That means, the native C++/CLI function pointer usage is almost 3x slower than using a managed delegate from within C++/CLI code. How can that be? I should use managed constructs when it comes to using interfaces, delegates or abstract classes in performance-critical sections?

The test code

The function which gets called continuously:

__int64 DoIt(int n, __int64 sum)
{
    if ((n % 3) == 0)
        return sum + n;
    else
        return sum + 1;
}

The code, which invokes the method, tries to make use of all the parameters as well as the return value, so nothing gets optimized away (hopefully). Here's the code (for .NET delegates):

__int64 executions;
__int64 result;
System::Diagnostics::Stopwatch^ w = gcnew System::Diagnostics::Stopwatch();

System::Func<int, __int64, __int64>^ managedPtr = gcnew System::Func<int, __int64, __int64>(&DoIt);
w->Restart();
executions = 0;
result = 0;
while (w->ElapsedMilliseconds < 5000)
{
    for (int i=0; i < 1000000; i++)
        result += managedPtr(i, executions);
    executions++;
}
System::Console::WriteLine(".NET delegate:       {0}M executions with result {2} in {1}ms", executions, w->ElapsedMilliseconds, result);

Similar to the .NET delegate invocation, the C++ function pointer is used:

typedef __int64 (* DoItMethod)(int n, __int64 sum);

DoItMethod nativePtr = DoIt;
w->Restart();
executions = 0;
result = 0;
while (w->ElapsedMilliseconds < 5000)
{
    for (int i=0; i < 1000000; i++)
        result += nativePtr(i, executions);
    executions++;
}
System::Console::WriteLine("Function pointer:    {0}M executions with result {2} in {1}ms", executions, w->ElapsedMilliseconds, result);

Additional infos

Compiled with Visual Studio 2012
.NET Framework 4.5 was targeted
Release build (execution counts stay proportional for Debug builds)
Calling convention is __stdcall (__fastcall not allowed when the project gets compiled with CLR support)

All tests done:

.NET virtual method: 1025M executions with result 171358304166325 in 5004ms
.NET delegate: 910M executions with result 152080413333030 in 5003ms
Virtual method: 336M executions with result 56056335999888 in 5006ms
Function pointer: 347M executions with result 57893422166551 in 5013ms
Function call: 1459M executions with result 244230520832847 in 5001ms
Inlined function: 1385M executions with result 231791984166205 in 5000ms

The direct call to "DoIt" is represented here by "Function call", which seems to get inlined by the compiler, as there is no (significant) difference in execution counts compared to a call to the inlined function.

Calls to C++ virtual methods are as 'slow' as the function pointer. A virtual method of a managed class (ref class) is as fast as the .NET delegate.

Update: I digged a little deeper, and it seems that for the tests with unmanaged functions, the transition to native code happens each time the DoIt function gets called. Therefore I wrapped the inner loop into another function which I forced to compile unmanaged:

#pragma managed(push, off)
__int64 TestCall(__int64* executions)
{
    __int64 result = 0;
    for (int i=0; i < 1000000; i++)
            result += DoItNative(i, *executions);
    (*executions)++;
    return result;
}
#pragma managed(pop)

Additionally I tested std::function like that:

#pragma managed(push, off)
__int64 TestStdFunc(__int64* executions)
{
    __int64 result = 0;
    std::function<__int64(int, __int64)> func(DoItNative);
    for (int i=0; i < 1000000; i++)
        result += func(i, *executions);
    (*executions)++;
    return result;
}
#pragma managed(pop)

Now, the new results are:

Function call: 2946M executions with result 495340439997054 in 5000ms
std::function: 160M executions with result 26679519999840 in 5018ms

std::function is a bit disappointing.

Solution

You are seeing the cost of "double thunking". The core problem with your DoIt() function is that it is being compiled as managed code. The delegate call is very fast, it is uncomplicated to go from managed to managed code through a delegate. The function pointer is slow however, the compiler automatically generates code to first switch from managed code to unmanaged code and make the call through the function pointer. Which then ends up in a stub that switches from unmanaged code back to managed code and calls DoIt().

Presumably what you really meant to measure was a call to native code. Use a #pragma to force DoIt() to be generated as machine code, like this:

#pragma managed(push, off)
__int64 DoIt(int n, __int64 sum)
{
    if ((n % 3) == 0)
        return sum + n;
    else
        return sum + 1;
}
#pragma managed(pop)

You'll now see that the function pointer is faster than a delegate