c assembly optimization cpu-architecture microbenchmark

Why cannot my program reach integer addition instruction throughput bound?

I have read chapter 5 of CSAPP 3e. I want to test if the optimization techniques described in the book can work on my computer. I write the following program:

#define SIZE (1024)
int main(int argc, char* argv[]) {
  int sum = 0;
  int* array = malloc(sizeof(int) * SIZE);
  unsigned long long before = __rdtsc();
  for (int i = 0; i < SIZE; ++i) {
    sum += array[i];
  }
  unsigned long long after = __rdtsc();
  double cpe = (double)(after - before) / SIZE;
  printf("CPE is %f\n", cpe);
  printf("sum is %d\n", sum);
  return 0;
}

and it reports the CPE is around 1.00.

I transform the program using the 4x4 loop unrolling technique and it leads to the following program:

#define SIZE (1024)
int main(int argc, char* argv[]) {
  int sum = 0;
  int* array = malloc(sizeof(int) * SIZE);

  int sum0 = 0;
  int sum1 = 0;
  int sum2 = 0;
  int sum3 = 0;
  /* 4x4 unrolling */
  unsigned long long before = __rdtsc();
  for (int i = 0; i < SIZE; i += 4) {
    sum0 += array[i];
    sum1 += array[i + 1];
    sum2 += array[i + 2];
    sum3 += array[i + 3];
  }
  unsigned long long after = __rdtsc();
  sum = sum0 + sum1 + sum2 + sum3;
  double cpe = (double)(after - before) / SIZE;
  printf("CPE is %f\n", cpe);
  printf("sum is %d\n", sum);
  return 0;
}

Note that I omit the code to handle the situation when SIZE is not a multiple of 4. This program reports the CPE is around 0.80.

My program runs on an AMD 5950X, and according to AMD's software optimization manual (https://developer.amd.com/resources/developer-guides-manuals/), the integer addition instruction has a latency of 1 cycle and throughput of 4 instructions per cycle. It also has a load-store unit which could execute three independent load operations at the same time. My expectation of the CPE is 0.33, and I do not know why the result is so much higher.

My compiler is gcc 12.2.0. All programs are compiled with flags -Og.

I checked the assembly code of the optimized program, but found nothing helpful:

.L4:
        movslq  %r9d, %rcx
        addl    (%r8,%rcx,4), %r11d
        addl    4(%r8,%rcx,4), %r10d
        addl    8(%r8,%rcx,4), %ebx
        addl    12(%r8,%rcx,4), %esi
        addl    $4, %r9d
.L3:
        cmpl    $127, %r9d
        jle     .L4

I assume at least 3 of the 4 addl instructions should execute in parallel. However, the result of the program does not meet my expectation.

Solution

cmpl $127, %r9d is not a large iteration count compared to rdtsc overhead and the branch mispredict when you exit the loop, and time for the CPU to ramp up to max frequency.

Also, you want to measure core clock cycles, not TSC reference cycles. Put the loop in a static executable (for minimal startup overhead) and run it with perf stat to get core clocks for the whole process. (As in Can x86's MOV really be "free"? Why can't I reproduce this at all? or some perf experiments I've posted in other answers.)

See Idiomatic way of performance evaluation?

10M to 1000M total iterations is appropriate since that's still under a second and we only want to measure steady-state behaviour, not cold-cache or cold-branch-predictor effect. Or page-faults. Interrupt overhead tends to be under 1% on an idle system. Use perf stat --all-user to only count user-space cycles and instructions.

If you want to do it over an array (instead of just removing the pointer increment from the asm), do many passes over a small (16K) array so they all hit in L1d cache. Use a nested loop, or use an and to wrap an index.

Doing that, yes you should be able to measure the 3/clock throughput of add mem, reg on Zen3 and later, even if you leave in the movslq overhead and crap like that from compiler -Og output.

When you're truly micro-benchmarking to find out stuff about throughput of one form of one instruction, it's usually easier to write asm by hand than to coax a compiler into emitting the loop you want. (As long as you know enough asm to avoid pitfalls, e.g. .balign 64 before the loop just for good measure, to hopefully avoid front-end bottlenecks.)

See also https://uops.info/ for how they measure; for any given test, you can click on the link to see the asm loop body for the experiments they ran, and the raw perf counter outputs for each variation on the test. (Although I have to admit I forget what MPERF and APERF mean for AMD CPUs; the results for Intel CPUs are more obvious.) e.g. https://uops.info/html-tp/ZEN3/ADD_R32_M32-Measurements.html is the Zen3 results, which includes a test of 4 or 8 independent add reg, [r14+const] instructions as the inner loop body.

They also tested with an indexed addressing mode. With "With unroll_count=200 and no inner loop" they got identical results for MPERF / APERF / UOPS for 4 independent adds, with indexed vs. non-indexed addressing modes. (Their loops don't have a pointer increment.)