Search code examples
c#monobenchmarkingryujitbenchmarkdotnet

Why does Mono run a simple method slower whereas RyuJIT runs it significantly faster?


I created a simple benchmark out of curiosity, but cannot explain the results.

As benchmark data, I prepared an array of structs with some random values. The preparation phase is not benchmarked:

struct Val 
{
    public float val;
    public float min;
    public float max;
    public float padding;
}

const int iterations = 1000;
Val[] values = new Val[iterations];
// fill the array with randoms

Basically, I wanted to compare these two clamp implementations:

static class Clamps
{
    public static float ClampSimple(float val, float min, float max)
    {
        if (val < min) return min;          
        if (val > max) return max;
        return val;
    }

    public static T ClampExt<T>(this T val, T min, T max) where T : IComparable<T>
    {
        if (val.CompareTo(min) < 0) return min;
        if (val.CompareTo(max) > 0) return max;
        return val;
    }
}

Here are my benchmark methods:

[Benchmark]
public float Extension()
{
    float result = 0;
    for (int i = 0; i < iterations; ++i)
    {
        ref Val v = ref values[i];
        result += v.val.ClampExt(v.min, v.max);
    }

    return result;
}

[Benchmark]
public float Direct()
{
    float result = 0;
    for (int i = 0; i < iterations; ++i)
    {
        ref Val v = ref values[i];
        result += Clamps.ClampSimple(v.val, v.min, v.max);
    }

    return result;
}

I'm using BenchmarkDotNet version 0.10.12 with two jobs:

[MonoJob]
[RyuJitX64Job]

And these are the results I get:

BenchmarkDotNet=v0.10.12, OS=Windows 7 SP1 (6.1.7601.0)
Intel Core i7-6920HQ CPU 2.90GHz (Skylake), 1 CPU, 8 logical cores and 4 physical cores
Frequency=2836123 Hz, Resolution=352.5940 ns, Timer=TSC
  [Host]    : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3062.0
  Mono      : Mono 5.12.0 (Visual Studio), 64bit
  RyuJitX64 : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3062.0


    Method |       Job | Runtime |      Mean |     Error |    StdDev |
---------- |---------- |-------- |----------:|----------:|----------:|
 Extension |      Mono |    Mono | 10.860 us | 0.0063 us | 0.0053 us |
    Direct |      Mono |    Mono | 11.211 us | 0.0074 us | 0.0062 us |
 Extension | RyuJitX64 |     Clr |  5.711 us | 0.0014 us | 0.0012 us |
    Direct | RyuJitX64 |     Clr |  1.395 us | 0.0056 us | 0.0052 us |

I can accept that Mono is somewhat slower here in general. But what I don't understand is:

Why does Mono run the Direct method slower than Extension keeping in mind that Direct uses a very simple comparison method whereas Extension uses a method with additional method calls?

RyuJIT shows here a 4x advantage of the simple method.

Can anyone explain this?


Solution

  • Since nobody wanted to do some disassembly stuff, I answer my own question.

    It seems that the reason is the native code being generated by the JITs, not the array boundary checking or caching issues mentioned in the comments.

    RyuJIT generates a very efficient code for the ClampSimple method:

        vucomiss xmm1,xmm0
        jbe     M01_L00
        vmovaps xmm0,xmm1
        ret
    
    M01_L00:
        vucomiss xmm0,xmm2
        jbe     M01_L01
        vmovaps xmm0,xmm2
        ret
    
    M01_L01:
        ret
    

    It uses the CPU's native ucomiss operations to compare floats and also fast movaps operations to move those floats between CPU's registers.

    The extension method is slower because it has a couple of function calls to System.Single.CompareTo(System.Single), here's the first branch:

    lea     rcx,[rsp+30h]
    vmovss  dword ptr [rsp+38h],xmm1
    call    mscorlib_ni+0xda98f0
    test    eax,eax
    jge     M01_L00
    vmovss  xmm0,dword ptr [rsp+38h]
    add     rsp,28h
    ret
    

    Let's have a look at the native code Mono produces for the ClampSimple method:

        cvtss2sd    xmm0,xmm0  
        movss       xmm1,dword ptr [rsp+8]  
        cvtss2sd    xmm1,xmm1  
        comisd      xmm1,xmm0  
        jbe         M01_L00  
        movss       xmm0,dword ptr [rsp+8]  
        cvtss2sd    xmm0,xmm0  
        cvtsd2ss    xmm0,xmm0  
        jmp         M01_L01 
    
    M01_L00: 
        movss       xmm0,dword ptr [rsp]  
        cvtss2sd    xmm0,xmm0  
        movss       xmm1,dword ptr [rsp+10h]  
        cvtss2sd    xmm1,xmm1  
        comisd      xmm1,xmm0  
        jp          M01_L02
        jae         M01_L02  
        movss       xmm0,dword ptr [rsp+10h]  
        cvtss2sd    xmm0,xmm0  
        cvtsd2ss    xmm0,xmm0  
        jmp         M01_L01
    
    M01_L02:
        movss       xmm0,dword ptr [rsp]  
        cvtss2sd    xmm0,xmm0  
        cvtsd2ss    xmm0,xmm0  
    
    M01_L01:
        add         rsp,18h  
        ret 
    

    Mono's code converts floats to doubles and compares them using comisd. Furthermore, there are strange "convert flips" floatdoublefloat when preparing the return value. And also there is much more moving around between memory and registers. This explains why Mono's code for the simple method is slower compared to RyuJIT's one.

    The Extension method code is very similar to the RyuJIT's code, but again with strange converting flips floatdoublefloat:

    movss       xmm0,dword ptr [rbp-10h]  
    cvtss2sd    xmm0,xmm0  
    movsd       xmm1,xmm0  
    cvtsd2ss    xmm1,xmm1  
    lea         rbp,[rbp]  
    mov         r11,2061520h  
    call        r11  
    test        eax,eax  
    jge         M0_L0 
    movss       xmm0,dword ptr [rbp-10h]  
    cvtss2sd    xmm0,xmm0  
    cvtsd2ss    xmm0,xmm0
    ret
    

    It seems that RyuJIT can generate more efficient code for handling floats. Mono treats floats as doubles and converts the values each time, which also causes additional value transfers between CPU registers and memory.

    Note that all this is valid for Windows x64 only. I don't know how this benchmark will perform on Linux or Mac.