I'm trying to benchmark the fast inverse square root. The full code is here:
#include <benchmark/benchmark.h>
#include <math.h>
float number = 30942;
static void BM_FastInverseSqrRoot(benchmark::State &state) {
for (auto _ : state) {
// from wikipedia:
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) );
// y = y * ( threehalfs - ( x2 * y * y ) );
float result = y;
benchmark::DoNotOptimize(result);
}
}
static void BM_InverseSqrRoot(benchmark::State &state) {
for (auto _ : state) {
float result = 1 / sqrt(number);
benchmark::DoNotOptimize(result);
}
}
BENCHMARK(BM_FastInverseSqrRoot);
BENCHMARK(BM_InverseSqrRoot);
and here is the code in quick-bench if you want to run it yourself.
Compiling with GCC 11.2 and -O3, the BM_FastInverseSqrRoot is around 31 times slower than Noop (around 10 ns when I ran it locally on my machine). Compiling with Clang 13.0 and -O3, it is around 3.6 times slower than Noop (around 1 ns when I ran it locally on my machine). This is a 10x speed difference.
Here is the relevant Assembly (taken from quick-bench).
With GCC:
push %rbp
mov %rdi,%rbp
push %rbx
sub $0x18,%rsp
cmpb $0x0,0x1a(%rdi)
je 408c98 <BM_FastInverseSqrRoot(benchmark::State&)+0x28>
callq 40a770 <benchmark::State::StartKeepRunning()>
408c84 add $0x18,%rsp
mov %rbp,%rdi
pop %rbx
pop %rbp
jmpq 40aa20 <benchmark::State::FinishKeepRunning()>
nopw 0x0(%rax,%rax,1)
408c98 mov 0x10(%rdi),%rbx
callq 40a770 <benchmark::State::StartKeepRunning()>
test %rbx,%rbx
je 408c84 <BM_FastInverseSqrRoot(benchmark::State&)+0x14>
movss 0x1b386(%rip),%xmm4 # 424034 <_IO_stdin_used+0x34>
movss 0x1b382(%rip),%xmm3 # 424038 <_IO_stdin_used+0x38>
mov $0x5f3759df,%edx
nopl 0x0(%rax,%rax,1)
408cc0 movss 0x237a8(%rip),%xmm0 # 42c470 <number>
mov %edx,%ecx
movaps %xmm3,%xmm1
2.91% movss %xmm0,0xc(%rsp)
mulss %xmm4,%xmm0
mov 0xc(%rsp),%rax
44.70% sar %rax
3.27% sub %eax,%ecx
3.24% movd %ecx,%xmm2
3.27% mulss %xmm2,%xmm0
9.58% mulss %xmm2,%xmm0
10.00% subss %xmm0,%xmm1
10.03% mulss %xmm2,%xmm1
9.64% movss %xmm1,0x8(%rsp)
3.33% sub $0x1,%rbx
jne 408cc0 <BM_FastInverseSqrRoot(benchmark::State&)+0x50>
add $0x18,%rsp
mov %rbp,%rdi
pop %rbx
pop %rbp
408d0a jmpq 40aa20 <benchmark::State::FinishKeepRunning()>
With Clang:
push %rbp
push %r14
push %rbx
sub $0x10,%rsp
mov %rdi,%r14
mov 0x1a(%rdi),%bpl
mov 0x10(%rdi),%rbx
call 213a80 <benchmark::State::StartKeepRunning()>
test %bpl,%bpl
jne 212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
test %rbx,%rbx
je 212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
movss -0xf12e(%rip),%xmm0 # 203cec <_IO_stdin_used+0x8>
movss -0xf13a(%rip),%xmm1 # 203ce8 <_IO_stdin_used+0x4>
cs nopw 0x0(%rax,%rax,1)
nopl 0x0(%rax)
212e30 2.46% movd 0x3c308(%rip),%xmm2 # 24f140 <number>
4.83% movd %xmm2,%eax
8.07% mulss %xmm0,%xmm2
12.35% shr %eax
2.60% mov $0x5f3759df,%ecx
5.15% sub %eax,%ecx
8.02% movd %ecx,%xmm3
11.53% mulss %xmm3,%xmm2
3.16% mulss %xmm3,%xmm2
5.71% addss %xmm1,%xmm2
8.19% mulss %xmm3,%xmm2
16.44% movss %xmm2,0xc(%rsp)
11.50% add $0xffffffffffffffff,%rbx
jne 212e30 <BM_FastInverseSqrRoot(benchmark::State&)+0x40>
212e69 mov %r14,%rdi
call 213af0 <benchmark::State::FinishKeepRunning()>
add $0x10,%rsp
pop %rbx
pop %r14
pop %rbp
212e79 ret
They look pretty similar to me. Both seem to be using SIMD registers/instructions like mulss
. The GCC version has a sar
that is supposedly taking 46%? (But I think it's just mislabelled and it's the mulss, mov, sar
that together take 46%). Anyway, I'm not familiar enough with Assembly to really tell what is causing such a huge performance difference.
Anyone know?
Just FYI, Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64? - no, obsoleted by SSE1 rsqrtss
which you can use with or without a Newton iteration.
As people pointed out in comments, you're using 64-bit long
(since this is x86-64 on a non-Windows system), pointing it at a 32-bit float
. So as well as a strict-aliasing violation (use memcpy
or std::bit_cast<int32_t>(myfloat)
for type punning), that's a showstopper for performance as well as correctness.
Your perf report
output confirms it; GCC is doing a 32-bit movss %xmm0,0xc(%rsp)
store to the stack, then a 64-bit reload mov 0xc(%rsp),%rax
, which will cause a store forwarding stall costing much extra latency. And a throughput penalty, since actually you're testing throughput, not latency: the next computation of an inverse sqrt only has a constant input, not the result of the previous iteration. (benchmark::DoNotOptimize
contains a "memory"
clobber which stops GCC/clang from hoisting most of the computation out of the loop; they have to assume number
may have changed since it's not const
.)
The instruction waiting for the load result (the sar
) is getting the blame for those cycles, as usual. (When an interrupt fires to collect a sample upon the cycles
event counter wrapping around, the CPU has to figure out one instruction to blame for that event. Usually this ends up being the one waiting for an earlier slow instruction, or maybe just one after a slow instruction even without a data dependency, I forget.)
Clang chooses to assume that the upper 32 bits are zero, thus movd %xmm0, %eax
to just copy the register with an ALU uop, and the shr
instead of sar
because it knows it's shifting in a zero from the high half of the 64-bit long
it's pretending to work with. (A function call still used %rdi
so that isn't Windows clang.)
Fixing the code on the quick-bench link in the question to use int32_t
and std::bit_cast
, https://godbolt.org/z/qbxqsaW4e shows GCC and clang compile similarly with -Ofast
, although not identical. e.g. GCC loads number
twice, once into an integer register, once into XMM0. Clang loads once and uses movd eax, xmm2
to get it.
On QB (https://quick-bench.com/q/jYLeX2krrTs0afjQKFp6Nm_G2v8), now GCC's BM_FastInverseSqrRoot is faster by a factor of 2 than the naive version, without -ffast-math
And yes, the naive benchmark compiles to sqrtss
/ divss
without -ffast-math
, thanks to C++ inferring sqrtf
from sqrt(float)
. It does check for the number being >=0
every time, since quick-bench doesn't allow compiling with -fno-math-errno
to omit that check to maybe call the libm function. But that branch predicts perfectly so the loop should still easily just bottleneck on port 0 throughput (div/sqrt unit).
Quick-bench does allow -Ofast
, which is equivalent to -O3 -ffast-math
, which uses rsqrtss
and a Newton iteration. (Would be even faster with FMA available, but quick-bench doesn't allow -march=native
or anything. I guess one could use __attribute__((target("avx,fma")))
.
Quick-bench is now giving Error or timeout
whether I use that or not, with Permission error mapping pages. and suggesting a smaller -m/--mmap_pages
so I can't test on that system.
rsqrt with a Newton iteration (like compilers use at -Ofast
for this) is probably faster or similar to Quake's fast invsqrt, but with about 23 bits of precision.