performance assembly x86 cpu-architecture micro-optimization

ADD slower than ADC in the first step of a bigint multiply on Coffee Lake (Skylake)

Changing add to adc in the highlighted line below significantly improves performance. I find it very counter-intuitive, since add has more ports to execute and it does not depend on flags.

CPU: Intel i7-9750H (Coffee Lake).
UOPS_ISSUED.ANY for add = ~2.87 uops/cycle.
UOPS_ISSUED.ANY for adc = ~3.47 uops/cycle.
Retire slots is 98.5% of uops issued in both cases.

And it's reflected in benchmark times, add version is a lot slower.

I would appreciate if someone could help me understand why add is slower? I can provide more metrics, just don't know what to look for.

# Code to multiply large integer by a qword.
# RSI = input large integer (qword array).
# RDI = output large integer (qword array).
# RDX = qword to multiply the large integer by.
# ECX = number of 32-byte blocks to process (i.e. qwords count / 4).
# RAX = carry

xor eax, eax

.p2align 4
L_mul_word_loop:
  mulx r9, r8, [rsi]
  add r8, rax         ###  <-- "adc r8, rax" (+20-30% performance)
  mov [rdi], r8

  mulx rax, r8, [rsi+8]      # rax:r8 = RDX * [rsi+8]
  adc r9, r8                 # add high-half of last mulx to low half of this
  mov [rdi+8], r9            # and store.

  mulx r9, r8, [rsi+16]
  adc r8, rax
  mov [rdi+16], r8

  mulx rax, r8, [rsi+24]
  adc r9, r8
  mov [rdi+24], r9

  adc rax, 0                # include CF into high half of last mulx: can't wrap
  add rdi, 32
  add rsi, 32

  dec ecx
  jnz L_mul_word_loop

The loop-carried dependency chain is through CF between the add -> adc instructions, and in RAX from the bottom of the loop back to the top. (The largest possible product has a high half less than 0xFF..., so the final adc rax, 0 can't wrap and produce its own carry-out. e.g. 0xFF * 0xFF = 0xFE01). This dependency chain is 5 cycles long.

(Experimentally, see comments, using lea instead of add for pointer increments and removing the adc rax, 0 to shorten it to 4 cycles isn't faster than this version using adc at the top of the loop.)

Minimal reproducible code: add-adc.asm

Full code on GitHub (only tested on Mac; older version was working on Ubuntu, although I haven't checked in a while). I'm observing this effect on this test:

build/benchmark_asm_x86_64 --benchmark_filter=mul_word/100 --benchmark_repetitions=3 --benchmark_report_aggregates_only=true

Update

I added a minimal reproducible example link (single ASM file). Tried to gather per-iteration metrics. Note that outer loop iteration count is much higher than inner loop (want to keep arrays small enough for L1 cache), hopefully that does not skew the numbers too much. Here's what I got:

add per inner loop iteration:
~ 6.88 cycles
~ 20.00 uops_issued.any
~ 4.83 uops_dispatched_port.port1

adc per inner loop iteration:
~ 6.12 cycles
~ 20.11 uops_issued.any
~ 4.34 uops_dispatched_port.port1

Solution

I think this is caused by suboptimal scheduling, taking extra hit from writeback conflict penalties.

Comparing uops.info measurements for 64-bit mulx vs. variants of imul we can reasonably conclude that mulx comprises one port 1 uop with latency 3 producing the low 64 bits of the result, plus one port 5 uop producing the high 64 bits one cycle later.

Consider what happens when a 3-cycle uop is issued on port 1 at cycle i, and a 1-cycle uop is pending. It cannot be issued at cycle i (the port cannot accept a second uop on that cycle). It also cannot be issued on cycle i+2: it would cause a writeback conflict on cycle i+3 on that port (it cannot produce results for two uops on the same cycle).

By constructing a loop with forced writeback conflicts, Fabian Giesen demonstrated there's apparently an extra penalty when that happens.

Hence if you have a mix of single-cycle and multi-cycle uops competing for the same port, it's a bit of a landmine. My initial attempt to solve that was to add a fifth mulx instruction in the loop (which outputs would be just discarded), so there's a constant pressure of multi-cycle uops on p1 and p5 and adds would be scheduled elsewhere. That worked to some degree, but it's possible to do even better!

The following measurements are from Skylake (i7-6700).

The baseline version runs at ~6.9 cycles per iteration:

    10,337,227,178      cycles:u
     4,283,434,673       uops_dispatched_port.port_0:u
     7,278,477,707       uops_dispatched_port.port_1:u
     6,849,168,521       uops_dispatched_port.port_5:u
     5,609,252,055       uops_dispatched_port.port_6:u

    10,384,026,044      cycles:u
     3,922,820,702       uops_dispatched_port.port_2:u
     4,024,756,546       uops_dispatched_port.port_3:u
     6,388,517,494       uops_dispatched_port.port_4:u
     4,087,151,242       uops_dispatched_port.port_7:u

Note that the number of store-data uops on port 4 is more than 5% higher than it should be. Some stores are being replayed, and this will have an increasingly confounding effect on the following measurements.

The variant with adc in place of the initial add runs at 6 cycles per iteration:

     8,877,983,794      cycles:u
     5,181,755,017       uops_dispatched_port.port_0:u
     6,525,640,349       uops_dispatched_port.port_1:u
     6,019,129,311       uops_dispatched_port.port_5:u
     6,295,528,774       uops_dispatched_port.port_6:u

     9,040,426,883      cycles:u
     3,762,317,354       uops_dispatched_port.port_2:u
     3,814,931,097       uops_dispatched_port.port_3:u
     7,292,924,631       uops_dispatched_port.port_4:u
     4,462,674,038       uops_dispatched_port.port_7:u

(note even higher count on port 4)

Now let's swap around increments of rsi and rdi (rsi is needed earlier). This brings a very noticeable improvement to 5.6 cycles per iteration:

     8,376,301,855      cycles:u
     5,129,834,339       uops_dispatched_port.port_0:u
     6,632,790,174       uops_dispatched_port.port_1:u
     6,088,383,045       uops_dispatched_port.port_5:u
     6,173,097,806       uops_dispatched_port.port_6:u

     8,404,972,940      cycles:u
     4,287,284,508       uops_dispatched_port.port_2:u
     4,317,891,165       uops_dispatched_port.port_3:u
     7,408,432,079       uops_dispatched_port.port_4:u
     3,429,913,047       uops_dispatched_port.port_7:u

(port 4 count goes higher yet again)

Now, if we think that single-cycle instructions issuing on port 1 and 5 are causing grief to the scheduler, can we figure out a way to force them on other ports somehow? Yes! Macro-fused predicted non-taken add-jcc pair can issue only on ports 0 and 6. So let's add jz .+2 after add rsi, 32 (note, Skylake JCC erratum could bite me here, but luckily our loop starts at 16 bytes mod 32, so we avoid the problematic crossing):

     8,339,935,074      cycles:u
     4,749,784,934       uops_dispatched_port.port_0:u
     6,429,206,233       uops_dispatched_port.port_1:u
     6,045,479,355       uops_dispatched_port.port_5:u
     6,798,134,405       uops_dispatched_port.port_6:u

     8,330,518,755      cycles:u
     4,250,791,971       uops_dispatched_port.port_2:u
     4,284,637,776       uops_dispatched_port.port_3:u
     7,379,587,531       uops_dispatched_port.port_4:u
     3,503,715,123       uops_dispatched_port.port_7:u

Oh no, our cycles barely budged! What if we finally bite the bullet and nop out the first store (nop dword ptr [rdi] in place of mov [rdi], r8):

     7,912,840,444      cycles:u
     4,648,297,975       uops_dispatched_port.port_0:u
     6,334,248,230       uops_dispatched_port.port_1:u
     6,059,133,113       uops_dispatched_port.port_5:u
     6,982,987,722       uops_dispatched_port.port_6:u

     7,843,194,695      cycles:u
     3,810,009,654       uops_dispatched_port.port_2:u
     3,790,523,332       uops_dispatched_port.port_3:u
     6,094,690,751       uops_dispatched_port.port_4:u
     2,938,959,057       uops_dispatched_port.port_7:u

That's 5.25 cycles per iteration. Applying the same jz .+2 trick to add rdi, 32, we get:

     7,630,926,523      cycles:u
     5,831,880,767       uops_dispatched_port.port_0:u
     6,004,354,504       uops_dispatched_port.port_1:u
     6,002,011,214       uops_dispatched_port.port_5:u
     6,183,831,282       uops_dispatched_port.port_6:u

     7,638,112,528      cycles:u
     3,751,406,178       uops_dispatched_port.port_2:u
     3,695,775,382       uops_dispatched_port.port_3:u
     5,638,534,896       uops_dispatched_port.port_4:u
     3,091,934,468       uops_dispatched_port.port_7:u

And that's below 5.1 cycles per iteration, where predicted throughput is 5 cycles per iteration. We see that ports 1 and 5 are occupied just by the mulx instructions. Restoring the nopped out store, I get 5.36 cycles per iteration:

     8,041,908,278      cycles:u
     5,599,216,700       uops_dispatched_port.port_0:u
     6,010,109,267       uops_dispatched_port.port_1:u
     6,002,448,325       uops_dispatched_port.port_5:u
     6,417,804,937       uops_dispatched_port.port_6:u

     8,050,871,976      cycles:u
     4,183,403,858       uops_dispatched_port.port_2:u
     4,166,022,265       uops_dispatched_port.port_3:u
     7,179,871,612       uops_dispatched_port.port_4:u
     3,695,465,024       uops_dispatched_port.port_7:u

It's unclear to me what's causing the replays.