performance x86-64 intel atomic amd-processor

On the average modern x64 CPU is cmpxchg16b much slower than its 64 or 32 bit counterparts?

I believe that Windows has been using that instruction internally for a long time now, so it's something CPU manufacturers would have spent effort to optimise?

Of course assuming suitably aligned memory and no sharing of the cache line etc.

Solution

Out of curiosity, I wrote a small benchmark to compare the cost of 4- and 8-byte cmpxchg with cmpxchg16b:

#include <cstdint>
#include <benchmark/benchmark.h>

alignas(16) char input[16 * 1024] = {};

template<class T>
void do_benchmark(benchmark::State& state) {
    unsigned n = 0;
    T* p = reinterpret_cast<T*>(input);
    constexpr unsigned count = sizeof input / sizeof(T);
    unsigned i = 0;
    for(auto _ : state) {
        T v{0};
        n += __sync_bool_compare_and_swap(p + i++ % count, v, v);
    }
    benchmark::DoNotOptimize(n);
}

BENCHMARK_TEMPLATE(do_benchmark, std::int32_t);
BENCHMARK_TEMPLATE(do_benchmark, std::int64_t);
BENCHMARK_TEMPLATE(do_benchmark, __int128);
BENCHMARK_MAIN();

And ran it on Coffee Lake i9-9900KS CPU.

Results with gcc-8.3.0:

$ make -rC ~/src/test -j8 BUILD=release run_cmpxchg16b_benchmark
g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-{maybe-uninitialized,unused-function}} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-{functions,loops}=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
sudo cpupower frequency-set --related --governor performance >/dev/null
/home/max/src/test/release/gcc/cmpxchg16b_benchmark
2020-03-15 20:18:48
Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
Run on (16 X 5100 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 0.43, 0.40, 0.34
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t>       3.53 ns         3.53 ns    198281069
do_benchmark<std::int64_t>       3.53 ns         3.53 ns    198256710
do_benchmark<__int128>           6.35 ns         6.35 ns    110215116

Results with clang-8.0.0:

$ make -rC ~/src/test -j8 BUILD=release TOOLSET=clang run_cmpxchg16b_benchmark
clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-unused-function} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-functions=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
sudo cpupower frequency-set --related --governor performance >/dev/null
/home/max/src/test/release/clang/cmpxchg16b_benchmark
2020-03-15 20:19:00
Running /home/max/src/test/release/clang/cmpxchg16b_benchmark
Run on (16 X 5100 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 0.36, 0.39, 0.33
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t>       3.84 ns         3.84 ns    182461520
do_benchmark<std::int64_t>       3.84 ns         3.84 ns    182160259
do_benchmark<__int128>           5.99 ns         5.99 ns    116972653

It looks like cmpxchg16b is around 1.6-1.8x more expensive than 8-byte cmpxchg on Intel Coffee Lake.

Same benchmark on Ryzen 9 5950X and gcc-9.3.0:

Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
Run on (32 X 4889.51 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 1.11, 0.52, 0.33
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t>       1.58 ns         1.58 ns    436624535
do_benchmark<std::int64_t>       1.58 ns         1.58 ns    443977862
do_benchmark<__int128>           2.22 ns         2.22 ns    316143309

cmpxchg16b is around 1.4x more expensive than 8-byte cmpxchg on AMD Ryzen 9.