Consider the following valarray
-like class:
#include <stdlib.h>
struct va
{
void add1(const va& other);
void add2(const va& other);
size_t* data;
size_t size;
};
void va::add1(const va& other) {
for (size_t i = 0; i < size; ++i) {
data[i] += other.data[i];
}
}
void va::add2(const va& other){
for (size_t i = 0, s = size; i < s; ++i) {
data[i] += other.data[i];
}
}
The add2
function is vectorized for different compilers (MSVC, Clang, GCC, ICC), whereas add1
is not. See https://godbolt.org/z/c61qvrrbv
This is explained by potential aliasing: compilers cannot prove that one of elements pointed by data
is not the size
itself.
However, there's also potential overlap of elements pointed by data
and other.data
. For MSVC there is potential aliasing these elements and pointers themselves, as it does not take advantage of the Strict Aliasing Rule. This applies to both add1
and add2
.
The compilers are performing checks for all possible aliasing they suspect and execute vectorized operation for add2
.
Why aren't they adding more checks and having this optimization for add1
?
It looks like compilers aren't able to realize (and don't add code to check) that data[i]
never points to this->size
. This would make the loop trip-count not calculable ahead of the first iteration, which is something no mainstream compiler except ICC can ever handle.
Hopefully compilers can learn to check for possible overlap before applying their vectorization logic, like (.data > this) || (.data+.size) < this
; hopefully there's an efficient way to do that. They already invent code to check for overlap between two arrays in add2
.
(The more checking code is required at startup, the more profitable vectorization has to be to pay for itself; 64-bit scalar elements aren't that bad with baseline x86-64 compared to 128-bit vectors especially when the compiler doesn't know from PGO that size is usually not small and that the loop is hot. I tried gcc -march=icelake-client
and -march=znver4
to not only enable AVX2 but also set tuning heuristics for CPUs with very good vector throughput and cache/memory bandwidth. But still no joy, so this possible aliasing is probably a full roadblock, not a heuristic decision.)
this->size
Notice how GCC's loop branch is cmp rax, QWORD PTR [rdi+8]
, where rax
holds i
and [rdi+8]
is this->size
(x86-64 SysV calling convention), so it's reloading it every iteration. If we compile with -O3 -fno-tree-vectorize
, we see GCC's add2
loads the size into a register ahead of the loop, comparing against that inside the loop, i.e. hoisting the load. The fact that GCC doesn't do that with add1
is a pretty clear sign that it thinks data[i] += ...
can modify this->size
.
# GCC's add1 inner loop with -O3 -march=icelake-client
.L3:
mov rcx, QWORD PTR [rsi+rax*8]
add QWORD PTR [rdx+rax*8], rcx
inc rax
cmp rax, QWORD PTR [rdi+8]
jb .L3
Also, changing the type to unsigned *data;
or anything that can't point to a size_t
lets GCC, Clang, and ICC auto-vectorize of add1
. Using -fno-strict-aliasing
disables vectorization again. (And makes compilers extra "paranoid", reloading this->data
and other.data
every iteration, like MSVC was always doing. Also breaking vectorization of add2
for those compilers.)
Changing the pointer type doesn't help MSVC because it doesn't do type-based aliasing analysis; it always acts like gcc -fno-strict-aliasing
. MSVC's add2
already does check for more than just overlap of the pointed-to arrays, I think; probably some of that extra cmp/jcc is checking that this->data[i] += ...
won't change the .data
pointer in this
or other
.
std::vector
doesn't have size_t
members, usually avoiding thisstd::vector<size_t>
wouldn't have this problem (at least in non-MSVC compilers) because type-based aliasing knows that size_t *
can't point at another pointer. std::vector
normally stores three pointers: .begin
, .end
, and end_of_capacity
, so the size information is end-begin
, not a member storing size directly.
For iterating over one array, normally it's at least as efficient to increment a pointer like for (... ; ptr < endp ; ptr++) *ptr
so you're not using indexed addressing modes. That's presumably why std::vector
is normally laid out that way, rather than a pointer and two size_t
members.
Some RISC machines don't even have two-register indexed addressing modes. For iterating two arrays, some CPUs will do better with fewer instructions just incrementing one index instead of two pointer increments, but it depends on the microarchitecture, e.g. some x86 CPUs un-laminate add reg, [reg + reg]
into 2 uops in the back-end, not keeping it micro-fused, especially with 3-operand AVX instructions.
An efficient way (in asm) to loop over two arrays on CPUs with indexed addressing is to address one array relative to the other. It's UB to do this in C++, and would obfuscate your code, so it's something compilers should do for you. sub rsi, rdi
outside the loop (subtract pointers), then the loop body can be mov eax, [rsi + rdi]
(second array = difference + first) / add [rdi], eax
(first array) / add rdi, 8
(increment the pointer which is also the index for the other array.)
MSVC will actually do this optimization which other compilers haven't picked up yet. (Godbolt)
// Compilers still optimize without __restrict, but it gets rid of the noise of extra checking
void foo(int *__restrict a, int *__restrict b){
for (int i=0 ; i<10240; i++){
a[i] += b[i];
}
}
void foo(int * __restrict,int * __restrict) PROC ; foo, COMDAT
lea rax, QWORD PTR [rcx+32]
sub rdx, rcx ;;;; <---- Pointer subtraction
mov ecx, 320 ; 00000140H
npad 4
$LL4@foo:
vmovdqu ymm1, YMMWORD PTR [rax-32] ;; unrolled with 4 vectors
vpaddd ymm1, ymm1, YMMWORD PTR [rdx+rax-32]
vmovdqu YMMWORD PTR [rax-32], ymm1
vmovdqu ymm2, YMMWORD PTR [rax]
vpaddd ymm1, ymm2, YMMWORD PTR [rdx+rax]
vmovdqu YMMWORD PTR [rax], ymm1
vmovdqu ymm1, YMMWORD PTR [rax+32]
vpaddd ymm1, ymm1, YMMWORD PTR [rdx+rax+32]
vmovdqu YMMWORD PTR [rax+32], ymm1
vmovdqu ymm1, YMMWORD PTR [rax+64]
vpaddd ymm1, ymm1, YMMWORD PTR [rdx+rax+64]
vmovdqu YMMWORD PTR [rax+64], ymm1
lea rax, QWORD PTR [rax+128]
sub rcx, 1
jne SHORT $LL4@foo
vzeroupper
ret 0
Unfortunately, MSVC got it backwards and is using the two-register addressing mode as the memory source operand for vpaddq
. It'll un-laminate at issue/rename into the ROB on Intel Sandybridge-family, including at least Skylake, probably some later. But vpaddd ymm1, ymm1, [rdx]
would be 1 uop. The pure load vmovdqu
would always be 1 uop even with an indexed addressing mode.
Indexed stores aren't ideal either (the store-address uop can't run on port 7 on Haswell / Skylake), and MSVC does avoid that. But it could get the best of both worlds by doing the pure load from b[]
with an indexed addressing mode, and then memory-source vpadd
+ store with the simple addressing mode like [rdx+32]
.
So MSVC did save some code-size and is getting some benefit in back-end throughput by needing only one increment of loop overhead, and in AGU port contention so this can run at close to 1 vector per clock with L1d cache hits to let it do 2 loads + 1 store every cycle (Intel's optimization manual suggests that Skylake can't fully sustain that for 32-byte load/store, for some unknown reason),
With an indexed addressing mode for the store like GCC and clang use, it could only run at 1 vector per 1.5 cycles on Haswell / Skylake. (Ice Lake has two load AGUs and two separate store AGUs, avoiding this bottleneck, but I don't know if unlamination of indexed addressing modes is still a thing on Ice Lake or Alder Lake.)
I don't know what's up with MSVC preferring lea
for incrementing a pointer. Or for preferring sub ecx/jne
instead of comparing against an end-pointer with lea
before the loop instead of mov
. Then the end of the loop could be cmp rax, r8
/jne
or something, which would fuse into a single uop on AMD not just Intel.