Performance difference between Windows and Linux using Intel compiler: looking at the assembly

I am running a program on both Windows and Linux (x86-64). It has been compiled with the same compiler (Intel Parallel Studio XE 2017) with the same options, and the Windows version is 3 times faster than the Linux one. The culprit is a call to std::erf which is resolved in the Intel math library for both cases (by default, it is linked dynamically on Windows and statically on Linux but using dynamic linking on Linux gives the same performance).

Here is a simple program to reproduce the problem.

#include <cmath>
#include <cstdio>

int main() {
  int n = 100000000;
  float sum = 1.0f;

  for (int k = 0; k < n; k++) {
    sum += std::erf(sum);
  }

  std::printf("%7.2f\n", sum);
}

When I profile this program using vTune, I find that the assembly is a bit different in between the Windows and the Linux version. Here is the call site (the loop) on Windows

Block 3:
"vmovaps xmm0, xmm6"
call 0x1400023e0 <erff>
Block 4:
inc ebx
"vaddss xmm6, xmm6, xmm0"
"cmp ebx, 0x5f5e100"
jl 0x14000103f <Block 3>

And the beginning of the erf function called on Windows

Block 1:
push rbp
"sub rsp, 0x40"
"lea rbp, ptr [rsp+0x20]"
"lea rcx, ptr [rip-0xa6c81]"
"movd edx, xmm0"
"movups xmmword ptr [rbp+0x10], xmm6"
"movss dword ptr [rbp+0x30], xmm0"
"mov eax, edx"
"and edx, 0x7fffffff"
"and eax, 0x80000000"
"add eax, 0x3f800000"
"mov dword ptr [rbp], eax"
"movss xmm6, dword ptr [rbp]"
"cmp edx, 0x7f800000"
...

On Linux, the code is a bit different. The call site is:

Block 3
"vmovaps %xmm1, %xmm0"
"vmovssl  %xmm1, (%rsp)"
callq  0x400bc0 <erff>
Block 4
inc %r12d
"vmovssl  (%rsp), %xmm1"
"vaddss %xmm0, %xmm1, %xmm1"   <-------- hotspot here
"cmp $0x5f5e100, %r12d"
jl 0x400b6b <Block 3>

and the beginning of the called function (erf) is:

"movd %xmm0, %edx"
"movssl  %xmm0, -0x10(%rsp)"   <-------- hotspot here
"mov %edx, %eax"
"and $0x7fffffff, %edx"
"and $0x80000000, %eax"
"add $0x3f800000, %eax"
"movl  %eax, -0x18(%rsp)"
"movssl  -0x18(%rsp), %xmm0"
"cmp $0x7f800000, %edx"
jnl 0x400dac <Block 8>
...

I have shown the 2 points where the time is lost on Linux.

Does anyone understand assembly enough to explain me the difference of the 2 codes and why the Linux version is 3 times slower?

Solution

In both cases the arguments and results are passed only in registers, as per the respective calling conventions on Windows and GNU/Linux.

In the GNU/Linux variant, the xmm1 is used for accumulating the sum. Since it's a call-clobbered register (a.k.a caller-saved) it's stored (and restored) in the stack frame of the caller on each call.

In the Windows variant, the xmm6 is used for accumulating the sum. This register is callee-saved in the Windows calling convention (but not in the GNU/Linux one).

So, in summary, the GNU/Linux version saves/restores both xmm0 (in the callee[1]) and xmm1 (in the caller), whereas the Windows version saves/restores only xmm6 (in the callee).

[1] need to look at std::errf to figure out why.