Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with (long)float type punning)

I'm trying to benchmark the fast inverse square root. The full code is here:

#include <benchmark/benchmark.h>
#include <math.h>

float number = 30942;
    
static void BM_FastInverseSqrRoot(benchmark::State &state) {
    for (auto _ : state) {
        // from wikipedia:
        long i;
        float x2, y;
        const float threehalfs = 1.5F;

        x2 = number * 0.5F;
        y  = number;
        i  = * ( long * ) &y;
        i  = 0x5f3759df - ( i >> 1 );
        y  = * ( float * ) &i;
        y  = y * ( threehalfs - ( x2 * y * y ) );
        //  y  = y * ( threehalfs - ( x2 * y * y ) );
        
        float result = y;
        benchmark::DoNotOptimize(result);
    }
}


static void BM_InverseSqrRoot(benchmark::State &state) {
    for (auto _ : state) {
        float result = 1 / sqrt(number);
        benchmark::DoNotOptimize(result);
    } 
}

BENCHMARK(BM_FastInverseSqrRoot);
BENCHMARK(BM_InverseSqrRoot);

and here is the code in quick-bench if you want to run it yourself.

Compiling with GCC 11.2 and -O3, the BM_FastInverseSqrRoot is around 31 times slower than Noop (around 10 ns when I ran it locally on my machine). Compiling with Clang 13.0 and -O3, it is around 3.6 times slower than Noop (around 1 ns when I ran it locally on my machine). This is a 10x speed difference.

Here is the relevant Assembly (taken from quick-bench).

With GCC:

               push   %rbp
               mov    %rdi,%rbp
               push   %rbx
               sub    $0x18,%rsp
               cmpb   $0x0,0x1a(%rdi)
               je     408c98 <BM_FastInverseSqrRoot(benchmark::State&)+0x28>
               callq  40a770 <benchmark::State::StartKeepRunning()>
  408c84       add    $0x18,%rsp
               mov    %rbp,%rdi
               pop    %rbx
               pop    %rbp
               jmpq   40aa20 <benchmark::State::FinishKeepRunning()>
               nopw   0x0(%rax,%rax,1)
  408c98       mov    0x10(%rdi),%rbx
               callq  40a770 <benchmark::State::StartKeepRunning()>
               test   %rbx,%rbx
               je     408c84 <BM_FastInverseSqrRoot(benchmark::State&)+0x14>
               movss  0x1b386(%rip),%xmm4        # 424034 <_IO_stdin_used+0x34>
               movss  0x1b382(%rip),%xmm3        # 424038 <_IO_stdin_used+0x38>
               mov    $0x5f3759df,%edx
               nopl   0x0(%rax,%rax,1)
   408cc0      movss  0x237a8(%rip),%xmm0        # 42c470 <number>
               mov    %edx,%ecx
               movaps %xmm3,%xmm1
        2.91%  movss  %xmm0,0xc(%rsp)
               mulss  %xmm4,%xmm0
               mov    0xc(%rsp),%rax
        44.70% sar    %rax
        3.27%  sub    %eax,%ecx
        3.24%  movd   %ecx,%xmm2
        3.27%  mulss  %xmm2,%xmm0
        9.58%  mulss  %xmm2,%xmm0
        10.00% subss  %xmm0,%xmm1
        10.03% mulss  %xmm2,%xmm1
        9.64%  movss  %xmm1,0x8(%rsp)
        3.33%  sub    $0x1,%rbx
               jne    408cc0 <BM_FastInverseSqrRoot(benchmark::State&)+0x50>
               add    $0x18,%rsp
               mov    %rbp,%rdi
               pop    %rbx
               pop    %rbp
  408d0a       jmpq   40aa20 <benchmark::State::FinishKeepRunning()>

With Clang:

           push   %rbp
           push   %r14
           push   %rbx
           sub    $0x10,%rsp
           mov    %rdi,%r14
           mov    0x1a(%rdi),%bpl
           mov    0x10(%rdi),%rbx
           call   213a80 <benchmark::State::StartKeepRunning()>
           test   %bpl,%bpl
           jne    212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
           test   %rbx,%rbx
           je     212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
           movss  -0xf12e(%rip),%xmm0        # 203cec <_IO_stdin_used+0x8>
           movss  -0xf13a(%rip),%xmm1        # 203ce8 <_IO_stdin_used+0x4>
           cs nopw 0x0(%rax,%rax,1)
           nopl   0x0(%rax)
 212e30 2.46%  movd   0x3c308(%rip),%xmm2        # 24f140 <number>
        4.83%  movd   %xmm2,%eax
        8.07%  mulss  %xmm0,%xmm2
        12.35% shr    %eax
        2.60%  mov    $0x5f3759df,%ecx
        5.15%  sub    %eax,%ecx
        8.02%  movd   %ecx,%xmm3
        11.53% mulss  %xmm3,%xmm2
        3.16%  mulss  %xmm3,%xmm2
        5.71%  addss  %xmm1,%xmm2
        8.19%  mulss  %xmm3,%xmm2
        16.44% movss  %xmm2,0xc(%rsp)
        11.50% add    $0xffffffffffffffff,%rbx
               jne    212e30 <BM_FastInverseSqrRoot(benchmark::State&)+0x40>
 212e69        mov    %r14,%rdi
               call   213af0 <benchmark::State::FinishKeepRunning()>
               add    $0x10,%rsp
               pop    %rbx
               pop    %r14
               pop    %rbp
  212e79       ret

They look pretty similar to me. Both seem to be using SIMD registers/instructions like mulss. The GCC version has a sar that is supposedly taking 46%? (But I think it's just mislabelled and it's the mulss, mov, sar that together take 46%). Anyway, I'm not familiar enough with Assembly to really tell what is causing such a huge performance difference.

Anyone know?

Solution

Just FYI, Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64? - no, obsoleted by SSE1 rsqrtss which you can use with or without a Newton iteration.

As people pointed out in comments, you're using 64-bit long (since this is x86-64 on a non-Windows system), pointing it at a 32-bit float. So as well as a strict-aliasing violation (use memcpy or std::bit_cast<int32_t>(myfloat) for type punning), that's a showstopper for performance as well as correctness.

Your perf report output confirms it; GCC is doing a 32-bit movss %xmm0,0xc(%rsp) store to the stack, then a 64-bit reload mov 0xc(%rsp),%rax, which will cause a store forwarding stall costing much extra latency. And a throughput penalty, since actually you're testing throughput, not latency: the next computation of an inverse sqrt only has a constant input, not the result of the previous iteration. (benchmark::DoNotOptimize contains a "memory" clobber which stops GCC/clang from hoisting most of the computation out of the loop; they have to assume number may have changed since it's not const.)

The instruction waiting for the load result (the sar) is getting the blame for those cycles, as usual. (When an interrupt fires to collect a sample upon the cycles event counter wrapping around, the CPU has to figure out one instruction to blame for that event. Usually this ends up being the one waiting for an earlier slow instruction, or maybe just one after a slow instruction even without a data dependency, I forget.)

Clang chooses to assume that the upper 32 bits are zero, thus movd %xmm0, %eax to just copy the register with an ALU uop, and the shr instead of sar because it knows it's shifting in a zero from the high half of the 64-bit long it's pretending to work with. (A function call still used %rdi so that isn't Windows clang.)

Bugfixed version: GCC and clang make similar asm

Fixing the code on the quick-bench link in the question to use int32_t and std::bit_cast, https://godbolt.org/z/qbxqsaW4e shows GCC and clang compile similarly with -Ofast, although not identical. e.g. GCC loads number twice, once into an integer register, once into XMM0. Clang loads once and uses movd eax, xmm2 to get it.

On QB (https://quick-bench.com/q/jYLeX2krrTs0afjQKFp6Nm_G2v8), now GCC's BM_FastInverseSqrRoot is faster by a factor of 2 than the naive version, without -ffast-math

And yes, the naive benchmark compiles to sqrtss / divss without -ffast-math, thanks to C++ inferring sqrtf from sqrt(float). It does check for the number being >=0 every time, since quick-bench doesn't allow compiling with -fno-math-errno to omit that check to maybe call the libm function. But that branch predicts perfectly so the loop should still easily just bottleneck on port 0 throughput (div/sqrt unit).

Quick-bench does allow -Ofast, which is equivalent to -O3 -ffast-math, which uses rsqrtss and a Newton iteration. (Would be even faster with FMA available, but quick-bench doesn't allow -march=native or anything. I guess one could use __attribute__((target("avx,fma"))).

Quick-bench is now giving Error or timeout whether I use that or not, with Permission error mapping pages. and suggesting a smaller -m/--mmap_pages so I can't test on that system.

rsqrt with a Newton iteration (like compilers use at -Ofast for this) is probably faster or similar to Quake's fast invsqrt, but with about 23 bits of precision.

Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with *(long*)float type punning)

Bugfixed version: GCC and clang make similar asm

Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with (long)float type punning)