I'm having a hard time interpreting Intel performance events reporting.
Consider the following simple program that mainly reads/writes memory:
#include <stdint.h>
#include <stdio.h>
volatile uint32_t a;
volatile uint32_t b;
int main() {
printf("&a=%p\n&b=%p\n", &a, &b);
for(size_t i = 0; i < 1000000000LL; i++) {
a ^= (uint32_t) i;
b += (uint32_t) i;
b ^= a;
}
return 0;
}
I compile it with gcc -O2
and run under perf
:
# gcc -g -O2 a.c
# perf stat -a ./a.out
&a=0x55a4bcf5f038
&b=0x55a4bcf5f034
Performance counter stats for 'system wide':
32,646.97 msec cpu-clock # 15.974 CPUs utilized
374 context-switches # 0.011 K/sec
1 cpu-migrations # 0.000 K/sec
1 page-faults # 0.000 K/sec
10,176,974,023 cycles # 0.312 GHz
13,010,322,410 instructions # 1.28 insn per cycle
1,002,214,919 branches # 30.699 M/sec
123,960 branch-misses # 0.01% of all branches
2.043727462 seconds time elapsed
# perf record -a ./a.out
&a=0x5589cc1fd038
&b=0x5589cc1fd034
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.997 MB perf.data (9269 samples) ]
# perf annotate
The result of perf annotate
(annotated for memory loads/stores by me):
Percent│ for(size_t i = 0; i < 1000000000LL; i ++) {
│ xor %eax,%eax
│ nop
│ a ^= (uint32_t) i;
│28: mov a,%edx // 32-bit load
│ xor %eax,%edx
9.74 │ mov %edx,a // 32-bit store
│ b += (uint32_t) i;
12.12 │ mov b,%edx // 32-bit load
8.79 │ add %eax,%edx
│ for(size_t i = 0; i < 1000000000LL; i ++) {
│ add $0x1,%rax
│ b += (uint32_t) i;
18.69 │ mov %edx,b // 32-bit store
│ b ^= a;
0.04 │ mov a,%ecx // 32-bit load
22.39 │ mov b,%edx // 32-bit load
8.92 │ xor %ecx,%edx
19.31 │ mov %edx,b // 32-bit store
│ for(size_t i = 0; i < 1000000000LL; i ++) {
│ cmp $0x3b9aca00,%rax
│ ↑ jne 28
│ }
│ return 0;
│ }
│ xor %eax,%eax
│ add $0x8,%rsp
│ ← retq
My observations:
insn per cycle
I conclude that the program is mainly memory-bound.a
and b
appear to be located in the same cache line, adjacent to each other.My Question:
mov a,%edx
) zero?mov a,%ecx
0.04%, while the one just next to it mov b,%edx
22.39%?Notes:
OS: Linux 4.19.0-amd64, CPU: Intel Core i9-9900K, 100% idle system (also tested on i7-7700, same result).
Not exactly "memory" bound, but bound on latency of store-forwarding. i9-9900K and i7-7700 have exactly the same microarchitecture for each core so that's not surprising :P https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Key_changes_from_Kaby_Lake. (Except possibly for improvement in hardware mitigation of Meltdown, and possibly fixing the loop buffer (LSD).)
Remember that when a perf event counter overflows and triggers a sample, the out-of-order superscalar CPU has to choose exactly one of the in-flight instructions to "blame" for this cycles
event. Often this is the oldest un-retired instruction in the ROB, or the one after. Be very suspicious of cycles
event samples over very small scales.
Perf never blames a load that was slow to produce a result, usually the instruction that was waiting for it. (In this case an xor
or add
). Here, sometimes the store consuming the result of that xor. These aren't cache-miss loads; store-forwarding latency is only about 3 to 5 cycles on Skylake (variable, and shorter if you don't try too soon: Loop with function call faster than an empty loop) so you do have loads completing at about 2 per 3 to 5 cycles.
You have two dependency chains through memory
b
. This is twice as long and will be the overall bottleneck for the loop.a
(with an extra read each iteration which can happen in parallel with the read that's part of the next a ^= i;
).The dep chain for i
only involves registers and can run far ahead of the others; it's no surprise that add $0x1,%rax
has no counts. Its execution cost is totally hidden in the shadow of waiting for loads.
I'm a bit surprised there are significant counts for mov %edx,a
. Perhaps it sometimes has to wait for the older store uops involving b
to run on the CPUs single store-data port. (Uops are dispatched to ports according to oldest-ready first. How are x86 uops scheduled, exactly?)
Uops can't retire until all previous uops have executed, so it could just be getting some skew from the store at the bottom of the loop. Uops retire in groups of 4, so if the mov %edx,b
does retire, the already-executed cmp/jcc, the mov load of a
, and the xor %eax,%edx
can retire with it. Those are not part of the dep chain that waits for b
, so they're always going to be sitting in the ROB waiting to retire whenever the b
store is ready to retire. (This is guesswork about how mov %edx,a
could be getting counts, despite not being part of a real bottleneck.)
The store-address uops should all run far ahead of the loop because they don't have to wait for previous iterations: RIP-relative addressing1 is ready right away. And they can run on port 7, or compete with loads for ports 2 or 3. Same for the loads: they can execute right away and detect what store they're waiting for, with the load buffer monitoring it and ready to report when the data becomes ready after the store-data uop does eventually run.
Presumably the front-end will eventually bottleneck on allocating load buffer entries, and that's what will limit how many uops can be in the back-end, not ROB or RS size.
Footnote 1: Your annotated output only shows a
not a(%rip)
so that's odd; doesn't matter if somehow you did get it to use 32-bit absolute, or if it's just a disassembly quirk failing to show RIP-relative.