Search code examples
performancex86-64intelatomicamd-processor

On the average modern x64 CPU is cmpxchg16b much slower than its 64 or 32 bit counterparts?


I believe that Windows has been using that instruction internally for a long time now, so it's something CPU manufacturers would have spent effort to optimise?

Of course assuming suitably aligned memory and no sharing of the cache line etc.


Solution

  • Out of curiosity, I wrote a small benchmark to compare the cost of 4- and 8-byte cmpxchg with cmpxchg16b:

    #include <cstdint>
    #include <benchmark/benchmark.h>
    
    alignas(16) char input[16 * 1024] = {};
    
    template<class T>
    void do_benchmark(benchmark::State& state) {
        unsigned n = 0;
        T* p = reinterpret_cast<T*>(input);
        constexpr unsigned count = sizeof input / sizeof(T);
        unsigned i = 0;
        for(auto _ : state) {
            T v{0};
            n += __sync_bool_compare_and_swap(p + i++ % count, v, v);
        }
        benchmark::DoNotOptimize(n);
    }
    
    BENCHMARK_TEMPLATE(do_benchmark, std::int32_t);
    BENCHMARK_TEMPLATE(do_benchmark, std::int64_t);
    BENCHMARK_TEMPLATE(do_benchmark, __int128);
    BENCHMARK_MAIN();
    

    And ran it on Coffee Lake i9-9900KS CPU.

    Results with gcc-8.3.0:

    $ make -rC ~/src/test -j8 BUILD=release run_cmpxchg16b_benchmark
    g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-{maybe-uninitialized,unused-function}} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-{functions,loops}=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
    g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
    sudo cpupower frequency-set --related --governor performance >/dev/null
    /home/max/src/test/release/gcc/cmpxchg16b_benchmark
    2020-03-15 20:18:48
    Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
    Run on (16 X 5100 MHz CPU s)
    CPU Caches:
      L1 Data 32 KiB (x8)
      L1 Instruction 32 KiB (x8)
      L2 Unified 256 KiB (x8)
      L3 Unified 16384 KiB (x1)
    Load Average: 0.43, 0.40, 0.34
    ---------------------------------------------------------------------
    Benchmark                           Time             CPU   Iterations
    ---------------------------------------------------------------------
    do_benchmark<std::int32_t>       3.53 ns         3.53 ns    198281069
    do_benchmark<std::int64_t>       3.53 ns         3.53 ns    198256710
    do_benchmark<__int128>           6.35 ns         6.35 ns    110215116
    

    Results with clang-8.0.0:

    $ make -rC ~/src/test -j8 BUILD=release TOOLSET=clang run_cmpxchg16b_benchmark
    clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-unused-function} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-functions=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
    clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
    sudo cpupower frequency-set --related --governor performance >/dev/null
    /home/max/src/test/release/clang/cmpxchg16b_benchmark
    2020-03-15 20:19:00
    Running /home/max/src/test/release/clang/cmpxchg16b_benchmark
    Run on (16 X 5100 MHz CPU s)
    CPU Caches:
      L1 Data 32 KiB (x8)
      L1 Instruction 32 KiB (x8)
      L2 Unified 256 KiB (x8)
      L3 Unified 16384 KiB (x1)
    Load Average: 0.36, 0.39, 0.33
    ---------------------------------------------------------------------
    Benchmark                           Time             CPU   Iterations
    ---------------------------------------------------------------------
    do_benchmark<std::int32_t>       3.84 ns         3.84 ns    182461520
    do_benchmark<std::int64_t>       3.84 ns         3.84 ns    182160259
    do_benchmark<__int128>           5.99 ns         5.99 ns    116972653
    
    

    It looks like cmpxchg16b is around 1.6-1.8x more expensive than 8-byte cmpxchg on Intel Coffee Lake.


    Same benchmark on Ryzen 9 5950X and gcc-9.3.0:

    Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
    Run on (32 X 4889.51 MHz CPU s)
    CPU Caches:
      L1 Data 32 KiB (x16)
      L1 Instruction 32 KiB (x16)
      L2 Unified 512 KiB (x16)
      L3 Unified 32768 KiB (x2)
    Load Average: 1.11, 0.52, 0.33
    ---------------------------------------------------------------------
    Benchmark                           Time             CPU   Iterations
    ---------------------------------------------------------------------
    do_benchmark<std::int32_t>       1.58 ns         1.58 ns    436624535
    do_benchmark<std::int64_t>       1.58 ns         1.58 ns    443977862
    do_benchmark<__int128>           2.22 ns         2.22 ns    316143309
    
    

    cmpxchg16b is around 1.4x more expensive than 8-byte cmpxchg on AMD Ryzen 9.