Search code examples
assemblyarmcpu-architecturearm64

The pipeline of add with lsl >4 in Neoverse N1


I have a question about the pipeline used by adds with shift (adds x3, x4, x5, lsl #32) in Neoverse N1, specifically adds x3, x4, x5, lsl #32.

According to Neoverse N1 Software Optimization (https://developer.arm.com/documentation/pjdoc466751330-9707/latest/ ), the instruction is supposed to use the M pipeline.

enter image description here

… but running some experiments on Graviton 2 (which is Neoverse N1) seems to suggest that it doesn’t seem to do so.

To check it, I wrote a code that saturates the M pipeline like this:

.rep 1000
    mul x7, x8, x9
.endr

I can observe that the average clock is 3.117, which seems to be consistent with the description of MADD which is good.

If I add adds x3, x4, x5, lsl #32 like this, the # cycle must increase because M pipeline is already saturated:

.rep 1000
    mul x7, x8, x9
    adds x3, x4, x5, lsl #32
.endr

However, the observed # clocks is still 3.13, which implies that mul and adds can run in parallel.

It is suspected that the adds instruction is actually using the ‘I’ pipeline. This experiment indirectly shows that:

.rep 1000
    add x1, x0, x0; add x1, x0, x0; add x1, x0, x0 // Three adds: saturates I pipeline
.endr

Average clock is 1.14

.rep 1000
    add x1, x0, x0; add x1, x0, x0; add x1, x0, x0
    adds x2, x3, x4, lsl #32
.endr

Average clock is 1.51, increased!


Solution

  • Apparently it can only start a new 64-bit mul x7, x8, x9 every 3 cycles (listed throughput 1/3), but you found it doesn't stop that pipe from starting a simpler operation on one of the other two cycles.

    There's probably a separate execution unit for shifts and adds, so when the footnote says "4. X-form multiply accumulates stall the multiplier pipeline for 2 extra cycles." apparently they mean the actual multiply execution unit, not other execution units with different latency on the same port. The manual doesn't say that (clearly or at all), but it would make sense.

    I wonder if a shift count of lsl #32 is special in some way; Maybe try lsl #27 or with an asr #27, some odd number that's not half a register. Or with a shift count from a register. But if adds x3, x4, x5, lsl #32 could run on any of the three I ports, you'd see 3/clock throughput for it on its own.


    Look at the diagram in section 2.2 of the document: there are only three total pipes for integer instructions; one of the single-cycle I pipes is also the integer multi-cycle pipe, so it's expected that adds x2, x3, x4, lsl #32 competes for throughput against plain add unshifted.

    M isn't separate from I, it's a limitation to one of the three I pipes. This is like Intel CPUs before Haswell. Three integer execution ports, p0, p1, and p5, but only one of them (p1) capable of running multi-cycle-latency integer uops like imul or popcnt. Listing a uop as p015 vs. p1 like Agner Fog's instruction tables or https://uops.info/ makes this clearer than "I" vs. "M", but it's the same thing. Except that Intel CPUs also have SIMD and branch execution units on those same ports, while Neoverse puts those on separate pipelines like AMD does for SIMD.