I know C-style arrays or std::array
aren't faster than vectors. I use vectors all the time (and I use them well). However, I have some situation in which the use of std::array
performs better than with std::vector
, and I have no clue why (tested with clang 7.0 and gcc 8.2).
Let me share a simple code:
#include <vector>
#include <array>
// some size constant
const size_t N = 100;
// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};
// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);
The above code which initializes the variables is not included in the benchmark. Now, let's write a function to combine elements (double
) of v1
and v2
, or of a1
and a2
:
// some combination
auto comb(const double m, const double f)
{
return m + f;
}
And the benchmark functions:
void assemble_vec()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(v1[0],v2[0]);
glob[i+1] += comb(v1[1],v2[1]);
glob[i+2] += comb(v1[2],v2[2]);
}
}
void assemble_arr()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(a1[0],a2[0]);
glob[i+1] += comb(a1[1],a2[1]);
glob[i+2] += comb(a1[2],a2[2]);
}
}
I've tried this with clang 7.0 and gcc 8.2. In both cases, the array version goes almost twice as fast as the vector version.
Does anyone know why?
C++ aliasing rules don't let the compiler prove that glob[i] += stuff
doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0};
or v2
.
const
on a std::vector
means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double *
in static storage.
Nothing in the std::vector
implementation lets the compiler rule out some other non-const
pointer pointing into that storage. For example, the double *data
in the control block of glob
.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vector
s doesn't overlap. They can't use __restrict
(even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict
.
But with const arr a1 {1.0,-1.0,1.0};
and a2
, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]);
and so on at compile time. In @Xirema's answer, you can see the asm output loads constants .LC1
and .LC2
. (Only two constants because both a1[0]+a2[0]
and a1[2]+a2[2]
are 1.0+1.0
. The loop body uses xmm2
as a source operand for addsd
twice, and the other constant once.)
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3]
won't modify the contents of v1[0..2]
, so it reloads from v1 and v2 every time through the loop after the store into glob
.
(It doesn't have to reload the vector<>
control block pointers, though, because type-based strict aliasing rules let it assume that storing a double
doesn't modify a double*
.)
The compiler could have checked that glob.data() + 0 .. N-3
didn't overlap with either of v1/v1.data() + 0 .. 2
, and made a different version of the loop for that case, hoisting the three comb()
results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec
(on the Godbolt compiler explorer), it load the data pointer from glob
, then adds 8 and subtracts the pointer again, producing a constant 8
. Then it branches at runtime on 8 > 784
(not taken) and then -8 < 784
(taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16
)
Anyway, it ends up running the ..B2.19
loop that hoists all 3 comb()
calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4]
, and 6 addsd
(scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd
(packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[]
had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data()
aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];
, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ...
won't affect any values that you access through other pointers, like v1[0]
.
In practice, that does not enable hoisting of comb()
for gcc, clang, or ICC -O3
. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data()
inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float
for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.