Search code examples
c++performanceassemblycompiler-optimizationinlining

Why does compiler inlining produce slower code than manual inlining?


Background

The following critical loop of a piece of numerical software, written in C++, basically compares two objects by one of their members:

for(int j=n;--j>0;)
    asd[j%16]=a.e<b.e;

a and b are of class ASD:

struct ASD  {
    float e;
    ...
};

I was investigating the effect of putting this comparison in a lightweight member function:

bool test(const ASD& y)const {
    return e<y.e;
}

and using it like this:

for(int j=n;--j>0;)
    asd[j%16]=a.test(b);

The compiler is inlining this function, but the problem is, that the assembly code will be different and cause >10% of runtime overhead. I have to question:

Questions

  1. Why is the compiler prodrucing different assembly code?

  2. Why is the produced assembly slower?

EDIT: The second question has been answered by implementing @KamyarSouri's suggestion (j%16). The assembly code now looks almost identical (see http://pastebin.com/diff.php?i=yqXedtPm). The only differences are the lines 18, 33, 48:

000646F9  movzx       edx,dl 

Material

This chart shows the FLOP/s (up to a scaling factor) for 50 testruns of my code.

enter image description here

The gnuplot script to generate the plot: http://pastebin.com/8amNqya7

Compiler Options:

/Zi /W3 /WX- /MP /Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /MT /GS- /Gy /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /Gd /analyze-

Linker Options: /INCREMENTAL:NO "kernel32.lib" "user32.lib" "gdi32.lib" "winspool.lib" "comdlg32.lib" "advapi32.lib" "shell32.lib" "ole32.lib" "oleaut32.lib" "uuid.lib" "odbc32.lib" "odbccp32.lib" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /LTCG /TLBID:1 /DYNAMICBASE /NXCOMPAT /MACHINE:X86 /ERRORREPORT:QUEUE


Solution

  • Short Answer:

    Your asd array is declared as this:

    int *asd=new int[16];
    

    Therefore, use int as the return type rather than bool.
    Alternatively, change the array type to bool.

    In any case, make the return type of the test function match the type of the array.

    Skip to bottom for more details.

    Long Answer:

    In the manually inlined version, the "core" of one iteration looks like this:

    xor         eax,eax  
     
    mov         edx,ecx  
    and         edx,0Fh  
    mov         dword ptr [ebp+edx*4],eax  
    mov         eax,dword ptr [esp+1Ch]  
    movss       xmm0,dword ptr [eax]  
    movss       xmm1,dword ptr [edi]  
    cvtps2pd    xmm0,xmm0  
    cvtps2pd    xmm1,xmm1  
    comisd      xmm1,xmm0  
    

    The compiler inlined version is completely identical except for the first instruction.

    Where instead of:

    xor         eax,eax
    

    it has:

    xor         eax,eax  
    movzx       edx,al
    

    Okay, so it's one extra instruction. They both do the same - zeroing a register. This is the only difference that I see...

    The movzx instruction has a single-cycle latency and 0.33 cycle reciprocal throughput on all the newer architectures. So I can't imagine how this could make a 10% difference.

    In both cases, the result of the zeroing is used only 3 instructions later. So it's very possible that this could be on the critical path of execution.


    While I'm not an Intel engineer, here's my guess:

    Most modern processors deal with zeroing operations (such as xor eax,eax) via register renaming to a bank of zero registers. It completely bypasses the execution units. However, it's possible that this special handling could cause a pipeline bubble when the (partial) register is accessed via movzx edi,al.

    Furthermore, there's also a false dependency on eax in the compiler inlined version:

    movzx       edx,al  
    mov         eax,ecx  //  False dependency on "eax".
    

    Whether or not the out-of-order execution is able to resolve this is beyond me.


    Okay, this is basically turning into a question of reverse-engineering the MSVC compiler...

    Here I'll to explain why that extra movzx is generated as well as why it stays.

    The key here is the bool return value. Apparently, bool datatypes are probably as stored 8-bit values inside the MSVC internal-representation. Therefore when you implicitly convert from bool to int here:

    asd[j%16] = a.test(b);
    ^^^^^^^^^   ^^^^^^^^^
     type int   type bool
    

    there is an 8-bit -> 32-bit integer promotion. This is the reason why MSVC generates the movzx instruction.

    When the inlining is done manually, the compiler has enough information to optimize out this conversion and keeps everything as a 32-bit datatype IR.

    However, when the code is put into it's own function with a bool return value, the compiler is not able to optimize out the 8-bit intermediate datatype. Therefore, the movzx stays.

    When you make both datatypes the same (either int or bool), no conversion is needed. Hence the problem is avoided altogether.