Can Cortex-A57 dual-issue 128-bit neon instructions?

The Cortex-A57 Optimization Guide states that most integer instructions operating on 128-bit vector data can be dual-issued (Page 24, integer basic F0/F1, logical F0/F1, execution throughput 2).

However with our internal (synthetic) benchmarks, throughput seems to be limited to exactly 1 128-bit neon integer instruction, even when there is plenty of instruction parallelism available (the benchmark was written with the intention to test whether 128-bit neon instructions can be dual-issued, so this is something we took care). When mixing 50% 128-bit with 50% 64-bit instructions, we were able to achieve 1.25 instructions per clock (only neon integer arith, no loads/stores).

Are there special measures which have to be taken in order to get dual-issue throughput when using 128-bit ASIMD/Neon instructions?

Thx, Clemens

Solution

In real code, not all instruction results will be written to the register file, instead they will pass through forwarding paths. If you mix dependent and independent instructions in your code, you may see higher IPC.

The A57 optimisation guide states that late-forwarding occurs for chains of multiply-accumulate instructions, so maybe something like this will dual-issue.

.loop
    vmla.s16 q0,q0,q1
    vmla.s16 q0,q0,q2
    vmla.s16 q0,q0,q3
    vmla.s16 q4,q4,q1
    vmla.s16 q4,q4,q2
    vmla.s16 q4,q4,q3
    ...etc