Understanding why an ASM fsqrt implementation is faster than the standard sqrt function

I have playing around with basic math function implementations in C++ for academic purposes. Today, I benchmarked the following code for Square Root:

inline float sqrt_new(float n)
{
    __asm {
        fld n
            fsqrt
    }
}

I was surprised to see that it is consistently faster than the standard sqrt function (it takes around 85% of the execution time of the standard function).

I don't quite get why and would love to better understand it. Below I show the full code I am using to profile (in Visual Studio 2015, compiling in Release mode and with all optimizations turned on):

#include <iostream>
#include <random>
#include <chrono>
#define M 1000000

float ranfloats[M];

using namespace std;

inline float sqrt_new(float n)
{
    __asm {
        fld n
            fsqrt
    }
}

int main()
{
    default_random_engine randomGenerator(time(0));
    uniform_real_distribution<float> diceroll(0.0f , 1.0f);

chrono::high_resolution_clock::time_point start1, start2;
chrono::high_resolution_clock::time_point end1, end2;
float sqrt1 = 0;
float sqrt2 = 0;


for (int i = 0; i<M; i++) ranfloats[i] = diceroll(randomGenerator);


start1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i<M; i++) sqrt1 += sqrt(ranfloats[i]);
end1 = std::chrono::high_resolution_clock::now();

start2 = std::chrono::high_resolution_clock::now();
for (int i = 0; i<M; i++) sqrt2 += sqrt_new(ranfloats[i]);
end2 = std::chrono::high_resolution_clock::now();

auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
auto time2  = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();


cout << "Time elapsed for SQRT1: " << time1 << " seconds" << endl;
cout << "Time elapsed for SQRT2: " << time2 << " seconds" << endl;

cout << "Average of Time for SQRT2 / Time for SQRT1: " << time2 / time1  << endl;
cout << "Equal to standard sqrt? " << (sqrt1 == sqrt2) << endl;



system("pause");
return 0;
}

EDIT: I am editing the question to include disassembly codes of both loops that calculate square roots as they came at Visual Studio 2015.

First, the disassembly for for (int i = 0; i<M; i++) sqrt1 += sqrt(ranfloats[i]);:

00091194 0F 5A C0             cvtps2pd    xmm0,xmm0  
00091197 E8 F2 18 00 00       call        __libm_sse2_sqrt_precise (092A8Eh)  
0009119C F2 0F 5A C0          cvtsd2ss    xmm0,xmm0  
000911A0 83 C6 04             add         esi,4  
000911A3 F3 0F 58 44 24 4C    addss       xmm0,dword ptr [esp+4Ch]  
000911A9 F3 0F 11 44 24 4C    movss       dword ptr [esp+4Ch],xmm0  
000911AF 81 FE 90 5C 46 00    cmp         esi,offset __dyn_tls_dtor_callback (0465C90h)  
000911B5 7C D9                jl          main+190h (091190h)

Next, the disassembly for for (int i = 0; i<M; i++) sqrt2 += sqrt_new(ranfloats[i]);:

00091290 F3 0F 10 00          movss       xmm0,dword ptr [eax]  
00091294 F3 0F 11 44 24 6C    movss       dword ptr [esp+6Ch],xmm0  
0009129A D9 44 24 6C          fld         dword ptr [esp+6Ch]  
0009129E D9 FA                fsqrt  
000912A0 D9 5C 24 6C          fstp        dword ptr [esp+6Ch]  
000912A4 F3 0F 10 44 24 6C    movss       xmm0,dword ptr [esp+6Ch]  
000912AA 83 C0 04             add         eax,4  
000912AD F3 0F 58 44 24 54    addss       xmm0,dword ptr [esp+54h]  
000912B3 F3 0F 11 44 24 54    movss       dword ptr [esp+54h],xmm0  
000912B9 ??                   ?? ?? 
000912BA ??                   ?? ?? 
000912BB ??                   ?? ?? 
000912BC ??                   ?? ?? 
000912BD ??                   ?? ?? 
000912BE ??                   ?? ?? 
000912BF ??                   ?? ?? 
000912C0 ??                   ?? ?? 
000912C1 ??                   ?? ?? 
000912C2 ??                   ?? ?? 
000912C3 ??                   ?? ?? 
000912C4 ??                   ?? ?? 
000912C5 ??                   ?? ?? 
000912C6 ??                   ?? ?? 
000912C7 ??                   ?? ?? 
000912C8 ??                   ?? ?? 
000912C9 ??                   ?? ?? 
000912CA ??                   ?? ?? 
000912CB ??                   ?? ?? 
000912CC ??                   ?? ?? 
000912CD ??                   ?? ?? 
000912CE ??                   ?? ?? 
000912CF ??                   ?? ?? 
000912D0 ??                   ?? ?? 
000912D1 ??                   ?? ?? 
000912D2 ??                   ?? ?? 
000912D3 ??                   ?? ?? 
000912D4 ??                   ?? ?? 
000912D5 ??                   ?? ?? 
000912D6 ??                   ?? ?? 
000912D7 ??                   ?? ?? 
000912D8 ??                   ?? ?? 
000912D9 ??                   ?? ?? 
000912DA ??                   ?? ?? 
000912DB ??                   ?? ?? 
000912DC ??                   ?? ?? 
000912DD ??                   ?? ?? 
000912DE ??                   ?? ??

Solution

Both your loops come out pretty horrible, with many bottlenecks other than the sqrt function call or the FSQRT instruction. And at least 2x slower than optimal scalar SQRTSS (single-precision) code could do. And that's maybe 8x slower than what a decent SSE2 vectorized loop could achieve. Even without reordering any math operations, you could beat SQRTSS throughput.

Many of the reasons from https://gcc.gnu.org/wiki/DontUseInlineAsm apply to your example. The compiler won't be able to propagate constants through your function, and it won't know that the result is alway non-negative (if it isn't NaN). It also won't be able to optimize it into an fabs() if you later square the number.

Also highly important, you defeat auto-vectorization with SSE2 SQRTPS (_mm_sqrt_ps()). A "no-error-checking" scalar sqrt() function using intrinsics also suffers from that problem. IDK if there's any way to get optimal results without /fp:fast, but I doubt it. (Other than writing a whole loop in assembly, or vectorizing the whole loop yourself with intrinsics).

It's pretty impressive that your Haswell CPU manages to run the function-call loop as fast as it does, although the inline-asm loop may not even be saturating FSQRT throughput either.

For some reason, your library function call is calling double sqrt(double), not the C++ overload float sqrt(float). This leads to a conversion to double and back to float. Probably you need to #include <cmath> to get the overloads, or you could call sqrtf(). gcc and clang on Linux call sqrtf() with your current code (without converting to double and back), but maybe their <random> header happens to include <cmath>, and MSVC's doesn't. Or maybe there's something else going on.

The library function-call loop keeps the sum in memory (instead of a register). Apparently the calling convention used by the 32-bit version of __libm_sse2_sqrt_precise doesn't preserve any XMM registers. The Windows x64 ABI does preserve XMM6-XMM15, but wikipedia says this is new and the 32-bit ABI didn't do that. I assume if there were any call-preserved XMM registers, MSVC's optimizer would take advantage of them.

Anyway, besides the throughput bottleneck of calling sqrt on each independent scalar float, the loop-carried dependency on sqrt1 is a latency bottleneck that includes a store-forwarding round trip:

000911A3 F3 0F 58 44 24 4C    addss       xmm0,dword ptr [esp+4Ch]  
000911A9 F3 0F 11 44 24 4C    movss       dword ptr [esp+4Ch],xmm0

Out of order execution lets rest of the code for each iteration overlap, so you just bottleneck on throughput, but no matter how efficient the library sqrt function is, this latency bottleneck limits the loop to one iteration per 6 + 3 = 9 cycles. (Haswell ADDSS latency = 3, store-forwarding latency for XMM load/store = 6 cycles. 1 cycle more than store-forwarding for integer registers. See Agner Fog's instruction tables.)

SQRTSD has a throughput of one per 8-14 cycles, so the loop-carried dependency is not the limiting bottleneck on Haswell.

The inline-asm version with has a store/reload round trip for the sqrt result, but it's not part of the loop-carried dependency chain. MSVC inline-asm syntax makes it hard to avoid store-forwarding round trips to get data into / out of inline asm. But worse, you produce the result on the x87 stack, and the compiler wants to do SSE math in XMM registers.

And then MSVC shoots itself in the foot for no reason, keeping the sum in memory instead of in an XMM register. It looks inside inline-asm statements to see which registers they affect, so IDK why it doesn't see that your inline-asm statement doesn't clobber any XMM regs.

So MSVC does a much worse job than necessary here:

00091290     movss       xmm0,dword ptr [eax]       # load from the array
00091294     movss       dword ptr [esp+6Ch],xmm0   # store to the stack
0009129A     fld         dword ptr [esp+6Ch]        # x87 load from stack
0009129E     fsqrt  
000912A0     fstp        dword ptr [esp+6Ch]        # x87 store to the stack
000912A4     movss       xmm0,dword ptr [esp+6Ch]   # SSE load from the stack (of sqrt(array[i]))
000912AA     add         eax,4  
000912AD     addss       xmm0,dword ptr [esp+54h]   # SSE load+add of the sum
000912B3     movss       dword ptr [esp+54h],xmm0   # SSE store of the sum

So it has the same loop-carried dependency chain (ADDSS + store-forwarding) as the function-call loop. Haswell FSQRT has one per 8-17 cycle throughput, so probably it's still the bottleneck. (All the stores/reloads involving the array value are independent for each iteration, and out-of-order execution can overlap many iterations to hide that latency chain. However, they will clog up the load/store execution units and sometimes delay the critical-path loads/stores by an extra cycle. This is called a resource conflict.)

Without /fp:fast, the sqrtf() library function has to set errno if the result is NaN. This is why it can't inline to just a SQRTSS.

If you did want to implement a no-checks scalar sqrt function yourself, you'd do it with Intel intrinsics syntax:

// DON'T USE THIS, it defeats auto-vectorization
static inline
float sqrt_scalar(float x) {
    __m128 xvec = _mm_set_ss(x);
    xvec =  _mm_cvtss_f32(_mm_sqrt_ss(xvec));
}

This compiles to a near-optimal scalar loop with gcc and clang (without -ffast-math). See it on the Godbolt compiler explorer:

# gcc6.2 -O3  for the sqrt_new loop using _mm_sqrt_ss.  good scalar code, but don't optimize further.
.L10:
    movss   xmm0, DWORD PTR [r12]
    add     r12, 4
    sqrtss  xmm0, xmm0
    addss   xmm1, xmm0
    cmp     r12, rbx
    jne     .L10

This loop should bottleneck only on SQRTSS throughput (one per 7 clocks on Haswell, notably faster than SQRTSD or FSQRT), and with no resource conflicts. However, it's still garbage compared to what you could do even without re-ordering the FP adds (since FP add/mul aren't truly associative): a smart compiler (or programmer using intrinsics) would use SQRTPS to get 4 results with the same throughput as 1 result from SQRTSS. Unpack the vector of SQRT results to 4 scalars, and then you can keep exactly the same order of operations with identical rounding of intermediate results. I'm disappointed that clang and gcc didn't do this.

However, gcc and clang do manage to actually avoid calling a library function. clang3.9 (with just -O3) uses SQRTSS without even checking for NaN. I assume that's legal, and not a compiler bug. Maybe it sees that the code doesn't use errno?

gcc6.2 on the other hand speculatively inlines sqrtf(), with a SQRTSS and a check on the input to see if it needs to call the library function.

# gcc6.2's sqrt() loop, without -ffast-math.
# speculative inlining of SQRTSS with a check + fallback
# spills/reloads a lot of stuff to memory even when it skips the call :(

# xmm1 = 0.0  (gcc -fverbose-asm says it's holding sqrt2, which is zero-initialized, so I guess gcc decides to reuse that zero)
.L9:
    movss   xmm0, DWORD PTR [rbx]
    sqrtss  xmm5, xmm0
    ucomiss xmm1, xmm0                  # compare input against 0.0
    movss   DWORD PTR [rsp+8], xmm5
    jbe     .L8                         # if(0.0 <= SQRTSS input || unordered(0.0, input)) { skip the function call; }
    movss   DWORD PTR [rsp+12], xmm1    # silly gcc, this store isn't needed.  ucomiss doesn't modify xmm1
    call    sqrtf                       # called for negative inputs, but not for NaN.
    movss   xmm1, DWORD PTR [rsp+12]
.L8:
    movss   xmm4, DWORD PTR [rsp+4]     # silly gcc always stores/reloads both, instead of putting the stores/reloads inside the block that the jbe skips
    addss   xmm4, DWORD PTR [rsp+8]
    add     rbx, 4
    movss   DWORD PTR [rsp+4], xmm4
    cmp     rbp, rbx
    jne     .L9

gcc unfortunately shoots itself in the foot here, the same way MSVC does with inline-asm: There's a store-forwarding round trip as a loop-carried dependency. All the spill/reloads could be inside the block skipped by the JBE. Maybe gcc things negative inputs will be common.

Even worse, if you do use /fp:fast or -ffast-math, even a clever compiler like clang doesn't manage to rewrite your _mm_sqrt_ss into a SQRTPS. Clang is normally pretty good at not just mapping intrinsics to instructions 1:1, and will come up with more optimal shuffles and blends if you miss an opportunity to combine things.

So with fast FP math enabled, using _mm_sqrt_ss is a big loss. clang compiles the sqrt() library function call version into RSQRTPS + a newton-raphson iteration.

Also note that your microbenchmark code isn't sensitive to the latency of your sqrt_new() implementation, only the throughput. Latency often matters in real FP code, not just throughput. But in other cases, like doing the same thing independently to many array elements, latency doesn't matter, because out-of-order execution can hide it well enough by having instructions in flight from many loop iterations.

As I mentioned earlier, latency from theextra store/reload round trip your data takes on its way in/out of MSVC-style inline-asm is a serious problem here. When MSVC inlines the function, the fld n doesn't come directly from the array.

BTW, Skylake has SQRTPS/SS throughput of one per 3 cycles, but still 12 cycle latency. SQRTPD/SD throughput = one per 4-6 cycles, latency = 15-16 cycles. So FP square root is more pipelined on Skylake than on Haswell. This magnifies the difference between benchmarking FP sqrt latency vs. throughput.