I believe that Windows has been using that instruction internally for a long time now, so it's something CPU manufacturers would have spent effort to optimise?
Of course assuming suitably aligned memory and no sharing of the cache line etc.
Out of curiosity, I wrote a small benchmark to compare the cost of 4- and 8-byte cmpxchg
with cmpxchg16b
:
#include <cstdint>
#include <benchmark/benchmark.h>
alignas(16) char input[16 * 1024] = {};
template<class T>
void do_benchmark(benchmark::State& state) {
unsigned n = 0;
T* p = reinterpret_cast<T*>(input);
constexpr unsigned count = sizeof input / sizeof(T);
unsigned i = 0;
for(auto _ : state) {
T v{0};
n += __sync_bool_compare_and_swap(p + i++ % count, v, v);
}
benchmark::DoNotOptimize(n);
}
BENCHMARK_TEMPLATE(do_benchmark, std::int32_t);
BENCHMARK_TEMPLATE(do_benchmark, std::int64_t);
BENCHMARK_TEMPLATE(do_benchmark, __int128);
BENCHMARK_MAIN();
And ran it on Coffee Lake i9-9900KS CPU.
Results with gcc-8.3.0
:
$ make -rC ~/src/test -j8 BUILD=release run_cmpxchg16b_benchmark
g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-{maybe-uninitialized,unused-function}} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-{functions,loops}=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
sudo cpupower frequency-set --related --governor performance >/dev/null
/home/max/src/test/release/gcc/cmpxchg16b_benchmark
2020-03-15 20:18:48
Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
Run on (16 X 5100 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.43, 0.40, 0.34
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t> 3.53 ns 3.53 ns 198281069
do_benchmark<std::int64_t> 3.53 ns 3.53 ns 198256710
do_benchmark<__int128> 6.35 ns 6.35 ns 110215116
Results with clang-8.0.0
:
$ make -rC ~/src/test -j8 BUILD=release TOOLSET=clang run_cmpxchg16b_benchmark
clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-unused-function} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-functions=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
sudo cpupower frequency-set --related --governor performance >/dev/null
/home/max/src/test/release/clang/cmpxchg16b_benchmark
2020-03-15 20:19:00
Running /home/max/src/test/release/clang/cmpxchg16b_benchmark
Run on (16 X 5100 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.36, 0.39, 0.33
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t> 3.84 ns 3.84 ns 182461520
do_benchmark<std::int64_t> 3.84 ns 3.84 ns 182160259
do_benchmark<__int128> 5.99 ns 5.99 ns 116972653
It looks like cmpxchg16b
is around 1.6-1.8x more expensive than 8-byte cmpxchg
on Intel Coffee Lake.
Same benchmark on Ryzen 9 5950X and gcc-9.3.0
:
Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
Run on (32 X 4889.51 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 512 KiB (x16)
L3 Unified 32768 KiB (x2)
Load Average: 1.11, 0.52, 0.33
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t> 1.58 ns 1.58 ns 436624535
do_benchmark<std::int64_t> 1.58 ns 1.58 ns 443977862
do_benchmark<__int128> 2.22 ns 2.22 ns 316143309
cmpxchg16b
is around 1.4x more expensive than 8-byte cmpxchg
on AMD Ryzen 9.