performance assembly x86-64 perf micro-optimization

Inconsistent `perf annotate` memory load/store time reporting

I'm having a hard time interpreting Intel performance events reporting.

Consider the following simple program that mainly reads/writes memory:

#include <stdint.h>
#include <stdio.h>

volatile uint32_t a;
volatile uint32_t b;

int main() {
  printf("&a=%p\n&b=%p\n", &a, &b);
  for(size_t i = 0; i < 1000000000LL; i++) {
    a ^= (uint32_t) i;
    b += (uint32_t) i;
    b ^= a;
  }
  return 0;
}

I compile it with gcc -O2 and run under perf:

# gcc -g -O2 a.c
# perf stat -a ./a.out
&a=0x55a4bcf5f038
&b=0x55a4bcf5f034

 Performance counter stats for 'system wide':

         32,646.97 msec cpu-clock                 #   15.974 CPUs utilized
               374      context-switches          #    0.011 K/sec
                 1      cpu-migrations            #    0.000 K/sec
                 1      page-faults               #    0.000 K/sec
    10,176,974,023      cycles                    #    0.312 GHz
    13,010,322,410      instructions              #    1.28  insn per cycle
     1,002,214,919      branches                  #   30.699 M/sec
           123,960      branch-misses             #    0.01% of all branches

       2.043727462 seconds time elapsed
# perf record -a ./a.out
&a=0x5589cc1fd038
&b=0x5589cc1fd034
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.997 MB perf.data (9269 samples) ]
# perf annotate

The result of perf annotate (annotated for memory loads/stores by me):

Percent│      for(size_t i = 0; i < 1000000000LL; i ++) {
       │      xor    %eax,%eax
       │      nop
       │            a ^= (uint32_t) i;
       │28:   mov    a,%edx                             // 32-bit load
       │      xor    %eax,%edx
  9.74 │      mov    %edx,a                             // 32-bit store
       │            b += (uint32_t) i;
 12.12 │      mov    b,%edx                             // 32-bit load
  8.79 │      add    %eax,%edx
       │      for(size_t i = 0; i < 1000000000LL; i ++) {
       │      add    $0x1,%rax
       │            b += (uint32_t) i;
 18.69 │      mov    %edx,b                             // 32-bit store
       │            b ^= a;
  0.04 │      mov    a,%ecx                             // 32-bit load
 22.39 │      mov    b,%edx                             // 32-bit load
  8.92 │      xor    %ecx,%edx
 19.31 │      mov    %edx,b                             // 32-bit store
       │      for(size_t i = 0; i < 1000000000LL; i ++) {
       │      cmp    $0x3b9aca00,%rax
       │    ↑ jne    28
       │      }
       │      return 0;
       │    }
       │      xor    %eax,%eax
       │      add    $0x8,%rsp
       │    ← retq

My observations:

From the 1.28 insn per cycle I conclude that the program is mainly memory-bound.
a and b appear to be located in the same cache line, adjacent to each other.

My Question:

Shouldn't CPU time be more consistent for the various memory loads and stores?
Why is CPU time of the 1st memory load (mov a,%edx) zero?
Why is the time of the 3rd load mov a,%ecx 0.04%, while the one just next to it mov b,%edx 22.39%?
Why do some instructions take 0 time? The loop consists of 14 instructions, so each instruction must contribute some observable time.

Notes:

OS: Linux 4.19.0-amd64, CPU: Intel Core i9-9900K, 100% idle system (also tested on i7-7700, same result).

Solution

Not exactly "memory" bound, but bound on latency of store-forwarding. i9-9900K and i7-7700 have exactly the same microarchitecture for each core so that's not surprising :P https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Key_changes_from_Kaby_Lake. (Except possibly for improvement in hardware mitigation of Meltdown, and possibly fixing the loop buffer (LSD).)

Remember that when a perf event counter overflows and triggers a sample, the out-of-order superscalar CPU has to choose exactly one of the in-flight instructions to "blame" for this cycles event. Often this is the oldest un-retired instruction in the ROB, or the one after. Be very suspicious of cycles event samples over very small scales.

Perf never blames a load that was slow to produce a result, usually the instruction that was waiting for it. (In this case an xor or add). Here, sometimes the store consuming the result of that xor. These aren't cache-miss loads; store-forwarding latency is only about 3 to 5 cycles on Skylake (variable, and shorter if you don't try too soon: Loop with function call faster than an empty loop) so you do have loads completing at about 2 per 3 to 5 cycles.

You have two dependency chains through memory

The longest one involving two RMWs of b. This is twice as long and will be the overall bottleneck for the loop.
The other involving one RMW of a (with an extra read each iteration which can happen in parallel with the read that's part of the next a ^= i;).

The dep chain for i only involves registers and can run far ahead of the others; it's no surprise that add $0x1,%rax has no counts. Its execution cost is totally hidden in the shadow of waiting for loads.

I'm a bit surprised there are significant counts for mov %edx,a. Perhaps it sometimes has to wait for the older store uops involving b to run on the CPUs single store-data port. (Uops are dispatched to ports according to oldest-ready first. How are x86 uops scheduled, exactly?)

Uops can't retire until all previous uops have executed, so it could just be getting some skew from the store at the bottom of the loop. Uops retire in groups of 4, so if the mov %edx,b does retire, the already-executed cmp/jcc, the mov load of a, and the xor %eax,%edx can retire with it. Those are not part of the dep chain that waits for b, so they're always going to be sitting in the ROB waiting to retire whenever the b store is ready to retire. (This is guesswork about how mov %edx,a could be getting counts, despite not being part of a real bottleneck.)

The store-address uops should all run far ahead of the loop because they don't have to wait for previous iterations: RIP-relative addressing¹ is ready right away. And they can run on port 7, or compete with loads for ports 2 or 3. Same for the loads: they can execute right away and detect what store they're waiting for, with the load buffer monitoring it and ready to report when the data becomes ready after the store-data uop does eventually run.

Presumably the front-end will eventually bottleneck on allocating load buffer entries, and that's what will limit how many uops can be in the back-end, not ROB or RS size.

Footnote 1: Your annotated output only shows a not a(%rip) so that's odd; doesn't matter if somehow you did get it to use 32-bit absolute, or if it's just a disassembly quirk failing to show RIP-relative.