As far as I know, a loop that has serial data dependencies such as A[i] += A[i-1]
can't be vectorized.
But I'm not sure that A[i] += A[i+1]
is a raw data dependency or not and if this loop can be vectorized?
for(i = 0; i < n - 1; i++) {
A[i] += A[i + 1];
}
Your loop counter is increasing (i++
) so you're looking ahead with A[i+1]
, not behind. That means you're just reading the same original-input elements twice, not re-reading any recent outputs. Thus there's no serial dependency.
Just try it with a compiler and see vector loads / stores instead of scalar. (Easy to tell the difference with integer instead of FP, when compiling for x86). e.g. on https://godbolt.org/ with gcc or clang -O3
On a machine with efficient unaligned loads (like modern x86) your compiler will probably just do loads of A[ i +0..3]
and A[ i+1 + 0..3 ]
, but the other option is shuffling to create the offset vector, e.g. using x86 SSSE3 palignr
which was designed for that and was really helpful on Core 2 (which did not have efficient SIMD unaligned loads).
GCC and clang do both vectorize it for x86-64 with SSE2 (which is baseline for x86-64)
https://godbolt.org/z/HdNsvC -
GCC9.1 for x86-64 (with the default -mtune=generic
and only SSE2 available) chooses to do 2x load + add + store. clang8.0 chooses to unroll (as usual) and load from A[i+1 +4*unroll + 0..3]
and use shufps
to create the A[i + 0..3]
vectors. The middle-end optimizer probably uses a recipe that was good with palignr
, but then has to emulate it once it gets to code-gen and only has SSE2, not SSSE3. Also it's likely that input pointers will be 16-byte aligned, so loading vectors from 16*n + 4
bytes relative to that is unfortunate. Except it will bottleneck on shuffle throughput on recent Intel CPUs anyway.
Clang with AVX1 but not AVX2 (e.g. -march=sandybridge
) makes a hilarious mess: using 256-bit FP shuffles in multiple steps to emulate 256-bit palignr
, but then unpacking to 128-bit vectors for integer SIMD vpaddd
(packed 32-bit add), then vinsertf128
back to 256-bit for 256-bit stores. SnB doesn't even have 256-bit load/store units so those uops take 2 cycles to run, and have much larger than usual penalties for misaligned data.
A[i] += A[i-1]
is harder to vectorize, but with efficient shuffles there are speedups to be had, especially with floating point where the latency of the serial dependency hurts much more.
Or in general if there's a closed-form formula for striding n
steps ahead, you can run that in parallel in the elements of a SIMD vector, like in Is it possible to use SIMD on a serial dependency in a calculation, like an exponential moving average filter?