Core latency testing ARMv8.1

There is an interesting article about ARM8.1 Graviton 2 offering of AWS. This article has tests for CPU coherency where I am trying to repeat.

There is C++ code repo in GitHub named core-latency using Nonius Micro-benchmarking.

I managed to replicate the first test without atomic instructions using the command below to compile:

$ g++ -std=c++11 -Wall -pthread -O3 -Iinclude -o core-latency main.cpp -march=armv8-a

The article claims that ARMv8.1 uses atomic CAS operations and has much better performance. It also provides test results that are much better.

I tried to repeat it compiling with ARMv8.1, ARMv8.2, and ARMv8.3. Sample commands for compilation are below:

$ g++ -std=c++11 -Wall -pthread -O3 -Iinclude -o core-latency main.cpp -march=armv8.1-a+lse
$ g++ -std=c++11 -Wall -pthread -O3 -Iinclude -o core-latency main.cpp -march=armv8.2-a+lse
$ g++ -std=c++11 -Wall -pthread -O3 -Iinclude -o core-latency main.cpp -march=armv8.3-a+lse

None of these improved the performance. Because of that I got the assembly code for it using these commands:

g++ -std=c++11 -Wall -pthread -O3 -Iinclude -S main.cpp -march=armv8.1-a+lse
g++ -std=c++11 -Wall -pthread -O3 -Iinclude -S main.cpp -march=armv8.2-a+lse
g++ -std=c++11 -Wall -pthread -O3 -Iinclude -S main.cpp -march=armv8.3-a+lse

I searched the code and cannot find any CAS operations used. I also tried the different variations of compilation with or without "lse" and "-moutline-atomics".

I am not a C++ expert and I have a very basic understanding of it.

My guess is that the code needs some changes to use atomic instructions.

Tests are executed on m6g.16xlarge EC2 instance in AWS. OS Ubuntu 20.04.

So if someone can check the core-latency code and give some insights to make sure that it compiles with CAS instructions, that will be a great help.

Solution

After doing some more experiments, I found the problem. In the code snippet below are the steps:

making a comparison first (if state equals Ping)
calling the class method set to do an atomic store operation.

Code snippet from core-latency:

if (state == Ping)
   sync.set(Pong);
...
void set(State new_state)
{
  state.store(new_state);
}

All of the code never compiles to a CAS instruction. If you want to have an atomic compare and swap operation, you need to use the relevant method from atomic.

I have written below a sample code for experimenting:

#include <atomic>
#include <cstdio>

int main() {
  int expected = 0;
  int desired = 1;
  std::atomic<int> current; 
  current.store(expected);
  printf("Before %d\n", current.load());
  while(!current.compare_exchange_weak(expected,desired));
  printf("After %d\n", current.load());
}

I compiled it for ARMv8.1 and can see that it is using CAS instruction. I compiled it for ARMv8.0 and can see that it is not using CAS instruction (which is OK as it is not supported in this version).

So if I want to get CAS instruction sets used, I need to use atomic::compare_exchange_weak or atomic::compare_exchange_strong; otherwise, the compiler will not use CAS but compile your comparison and store operations separately.

In summary, I can rewrite the benchmark with atomic::compare_exchange_weak and see what results I am getting.

New update April 30

I have created the new version of the code with atomic compare and swap support. It is available here https://github.com/fuatu/core-latency-atomic

Here are the test results for instance m6g.16xlarge (ARM):

Without CAS: Average latency 245ns

With CAS: Average latency 39ns