I'm trying to implement the following operation using AVX:
for (i=0; i<N; i++) {
for(j=0; j<N; j++) {
for (k=0; k<K; k++) {
d[i][j] += 2 * a[i][k] * ( b[k][j]- c[k]);
}
}
}
for (int i=0; i<N; i++){
f+= d[ind[i]][ind[i]]/2;
}
Where d is a NxN matrix, a is a NxK, b a KxN and c a vector of length K. All of them are doubles. Of course, all the data is aligned and I am using #pragma vector aligned
to help compiler (gcc).
I know how to use AVX extensions with one-dimension arrays, but it is being a little bit tricky to me to do it with matrix. Currently, I have the following, but I'm not getting correct results:
for (int i=0; i< floor (N/4); i++){
for (int j=0; j< floor (N/4); j++){
__m256d D, A, B, C;
D = _mm256_setzero_pd();
#pragma vector aligned
for (int k=0; k<K_MAX; k++){
A = _mm256_load_pd(a[i] + k*4);
B = _mm256_load_pd(b[k] + j*4);
C = _mm256_load_pd(c + 4*k);
B = _mm256_sub_pd(B, C);
A = _mm256_mul_pd(A, B);
D = _mm256_add_pd(_mm256_set1_pd(2.0), A);
_mm256_store_pd(d[i] + j*4, D);
}
}
}
for (int i=0; i<N; i++){
f+= d[ind[i]][ind[i]]/2;
}
I hope someone can tell me where the mistake is.
Thanks in advance.
NOTE: I'm not willing to introduce OpenMP, just using SIMD Intel instructions
b[k][j]
is your problem: the elements b[k + 0..3][j]
aren't contiguous in memory. Using SIMD (in a reasonable / useful way) is not something you can drop in to the classic naive matmul loop. See What Every Programmer Should Know About Memory? - there's an appendix with an example of an SSE2 matmul (with cache-blocking) which shows how to do operations in a different order that's SIMD-friendly.
Soonts's answer shows how to vectorize at all, by vectorizing over j
, the middle loop. But that leaves a relatively poor memory access pattern, and 3 loads + 3 ALU operations inside the loop. (This answer started out as a comment on it, see it for the code I'm talking about and proposing changes to.)
Loop inversion should be possible to do j
as the inner-most loop. That would mean doing stores for d[i][j] += ...
inside the inner-most loop, but OTOH it makes more loop invariants in 2 * a[i][k] * ( b[k][j]- c[k] )
so you can usefully transform to d[i][j] += (2*a_ik) * b[k][j] - (2*a_ik*c_k)
, i.e. one VFMSUBPD and one VADDPD per load&store. (With the bv
load folding into the FMSUB as a memory source operand, and the dv
load folding into VADDPD, so hopefully only 3 uops for the front-end, including a separate store, not including loop overhead.)
The compiler will have to unroll and avoid an indexed addressing mode so the store-address uop can stay micro-fused and run on port 7 on Intel CPUs (Haswell through Skylake-family), not competing with the two loads. Ice Lake doesn't have that problem, having two full independent store-AGUs separate from the two load AGUs. But probably still needs some loop unrolling to avoid a front-end bottleneck.
Here's an example, untested (original version contributed by Soonts, thanks). It optimizes down to 2 FP math ops in the loop in a different way: simply hoisting 2*a
out of the loop, doing SUB then FMA for dv += (2av)*(sub_result)
. But bv
can't be a source operand for vsubpd
because we need bv - cv
. But we can fix that by negating cv
to allow (-cv) + bv
in the inner loop, with bv
as a memory source operand. Sometimes compilers will do things like that for you, but here it seems they didn't, so I did it manually. Otherwise we get a separate vmovupd
load going through the front-end.
#include <stdint.h>
#include <stdlib.h>
#include <immintrin.h>
// This double [N][N] C99 VLA syntax isn't portable to C++ even with GNU extensions
// restrict tells the compiler the output doesn't overlap with any of the inputs
void matop(size_t N, size_t K, double d[restrict N][N], const double a[restrict N][K], const double b[restrict K][N], const double c[restrict K])
{
for( size_t i = 0; i < N; i++ ) {
// loop-invariant pointers for this outer iteration
//double* restrict rowDi = &d[ i ][ 0 ];
const double* restrict rowAi = &a[ i ][ 0 ];
for( size_t k = 0; k < K; k++ ) {
const double* restrict rowBk = &b[ k ][ 0 ];
double* restrict rowDi = &d[ i ][ 0 ];
#if 0 // pure scalar
// auto-vectorizes ok; still a lot of extra checking outside outermost loop even with restrict
for (size_t j=0 ; j<N ; j++){
rowDi[j] += 2*rowAi[k] * (rowBk[j] - c[k]);
}
#else // SIMD inner loop with cleanup
// *** TODO: unroll over 2 or 3 i values
// and maybe also 2 or 3 k values, to reuse each bv a few times while it's loaded.
__m256d av = _mm256_broadcast_sd( rowAi + k );
av = _mm256_add_pd( av, av ); // 2*a[ i ][ k ] broadcasted
const __m256d cv = _mm256_broadcast_sd( &c[ k ] );
const __m256d minus_ck = _mm256_xor_pd(cv, _mm256_set1_pd(-0.0)); // broadcasted -c[k]
//const size_t N_aligned = ( (size_t)N / 4 ) * 4;
size_t N_aligned = N & -4; // round down to a multiple of 4 j iterations
const double* endBk = rowBk + N_aligned;
//for( ; j < N_aligned; j += 4 )
for ( ; rowBk != endBk ; rowBk += 4, rowDi += 4) { // coax GCC into using pointer-increments in the asm, instead of j+=4
// Load the output vector to update
__m256d dv = _mm256_loadu_pd( rowDi );
// Update with FMA
__m256d bv = _mm256_loadu_pd( rowBk );
__m256d t2 = _mm256_add_pd( minus_ck, bv ); // bv - cv
dv = _mm256_fmadd_pd( av, t2, dv );
// Store back to the same address
_mm256_storeu_pd( rowDi, dv );
}
// rowDi and rowBk point to the double after the last full vector
// The remainder, if you can't pad your rows to a multiple of 4 and step on that padding
for(int j=0 ; j < (N&3); j++ )
rowDi[ j ] += _mm256_cvtsd_f64( av ) * ( rowBk[ j ] + _mm256_cvtsd_f64( minus_ck ) );
#endif
}
}
}
Without unrolling (https://godbolt.org/z/6WeYKbnYY), GCC11's inner loop asm looks like this, all single-uop instructions that can stay micro-fused even in the back-end on Haswell and later.
.L7: # do{
vaddpd ymm0, ymm2, YMMWORD PTR [rax] # -c[k] + rowBk[0..3]
add rax, 32 # rowBk += 4
add rdx, 32 # rowDi += 4
vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdx-32] # fma(2aik, Bkj-ck, Dij)
vmovupd YMMWORD PTR [rdx-32], ymm0 # store FMA result
cmp rcx, rax
jne .L7 # }while(p != endp)
But it's 6 total uops, 3 of them loop overhead (pointer increments and fused cmp+jne), so Haswell through Skylake could only run it at 1 iteration per 1.5 clocks, bottlenecked on the 4-wide issue stage in the front-end. (Which wouldn't let OoO exec get ahead on executing the pointer increments and loop branch, to notice early and recover while the back-end was still chewing on older loads and FP math.)
So loop unrolling should be helpful, since we managed to coax GCC into using indexed addressing modes. Without that it's relatively useless with AVX code on Intel Haswell/Skylake CPUs, with each vaddpd ymm5, ymm4, [rax + r14]
decoding as 1 micro-fused uop, but unlaminating into 2 at issue into the back-end, not helping us get more work through the narrowest part of the front-end. (A lot like if we'd used a separate vmovupd
load like we got with _mm256_sub_pd(bv, cv)
instead of add(bv, -cv)
.)
The vmovupd ymmword ptr [rbp + r14], ymm5
store stays micro-fused but can't run on port 7, limiting us to a total of 2 memory operations per clock (up to 1 of which can be a store.) So a best case of 1.5 cycles per vector.
Compiled on https://godbolt.org/z/rd3rn9zor with GCC and clang -O3 -march=skylake -funroll-loops
. GCC does actually use pointer increments with loads folded into 8x vaddpd
and 8x vfmadd213pd
. But clang uses indexed addressing modes and doesn't unroll. (You probably don't want -funroll-loops
for your whole program, so either compile this separately or manually unroll. GCC's unrolling fully peels a prologue that does 0..7 vector iterations before entering the actual SIMD loop, so it's quite aggressive.)
GCC's loop-unrolling looks useful here for large N, amortizing the pointer increments and loop overhead over multiple vectors. (GCC doesn't know how to invent multiple accumulators for FP dep chains in a dot product for example, making its unrolling useless in that case, unlike clang.)
Unfortunately clang doesn't unroll the inner loop for us, but it does use vmaskmovpd
in an interesting way for the cleanup.
It's maybe good that we use a separate loop counter for cleanup, in a way that lets the compiler easily prove the trip-count for the cleanup is 0..3, so it doesn't try to auto-vectorize with YMM.
The other way to do it, using an actual j
variable for the inner loop and its cleanup, more like Soonts' edit. IIRC, compilers did try to auto-vectorize the cleanup for this, wasting code size and some always-false branching.
size_t j = 0; // used for cleanup loop after
for( ; j < N_aligned; j += 4 )
{
// Load the output vector to update
__m256d dv = _mm256_loadu_pd( rowDi + j );
// Update with FMA
__m256d bv = _mm256_loadu_pd( rowBk + j );
__m256d t2 = _mm256_sub_pd( bv, cv ); // bv - cv
dv = _mm256_fmadd_pd( av, t2, dv );
// Store back to the same address
_mm256_storeu_pd( rowDi + j, dv );
}
// The remainder, if you can't pad your rows to a multiple of 4
for( ; j < N; j++ )
rowDi[ j ] += _mm256_cvtsd_f64( av ) * ( rowBk[ j ] - _mm256_cvtsd_f64( cv ) );
This has a fairly good mix of load&store vs. FP math for modern CPUs (https://agner.org/optimize/ and https://uops.info/), especially Intel where we can do 2 loads and 1 store. I think Zen 2 or 3 can also do 2 loads + 1 store. It needs to hit in L1d cache to sustain that kind of throughput, though. (And even then, Intel's optimization manual says the max sustained L1d bandwidth on Skylake is less than the full 96 bytes/cycle that would require. More like mid-80s IIRC, so we can't quite expect one result vector per cycle, even with sufficient unrolling to avoid front-end bottlenecks.)
There's no latency bottleneck, since we move on to a new dv
every iteration instead of accumulating anything across loop iterations.
The other advantage to this is that memory access to d[i][j]
and b[k][j]
would be sequential, with no other memory access in the inner-most loop. (The middle loop would do broadcast-loads of a[i][k]
and c[k]
. Those seem likely to cache-miss if the inner loop evicts too much; with some unrolling of the outer loop, one SIMD load and some shuffling could help, but probably cache-blocking would avoid a need for that.)
Looping over the same d[i]
row repeatedly for different b[k]
rows gives us locality for the part that we're modifying (i.e. use k
as the middle loop, keeping i
as the outer-most.) With k
as the outer loop, we'd be looping K
times over the whole d[0..N-1][0..N-1]
, probably needing to write + read each pass all that way out to whichever level of cache or memory could hold it.
But really you'd still want to cache-block if each row is really long, so you avoid the cache misses to bring all of b[][]
in from DRAM N times. And avoid evicting the stuff you're going to broadcast-load next.
Some of the above problems with maxing out load/store execution unit throughput, and requiring the compiler to use non-indexed addressing modes, can go away if we do more with each vector of data while it's loaded.
For example, instead of working on just one row of d[][]
, we could be working on 2, 3, or 4. Then every (rowBk[j] - c[k])
result can be used that many times (with a different 2aik
) for a d[i+unroll][j + 0..vec]
vector.
And we can also load a couple different (rowBk+K*0..unroll)[j+0..3]
, each with a corresponding minus_ck0
, minus_ck1
, etc. (Or keep an array of vectors; as long as it's small and the compiler has enough registers, the elements won't exist in memory.)
With multiple bv-cv
and dv
vectors in registers all at the same time, we can do significantly more FMAs per load without increasing the total amount of FP work. It takes more registers for constants, though, otherwise we could be defeating the purpose by forcing more reloads.
The d[i][j] += (2*a_ik) * b[k][j] - (2*a_ik*c_k)
transformation wouldn't be useful here; we want to keep bv-cv
separate from i
so we can reuse that result as an input for different FMAs.
The b[k][j]+(-c[k])
can still benefit from micro-fusion of a load with a vaddpd
so ideally it would still use a pointer increment, but the front-end might not be a bottleneck anymore.
Don't overdo it with this; too many memory input streams can be a problem for cache conflict misses especially for some N values that might create aliasing, and also for HW prefetching tracking them all. (Although Intel's L2 streamer is said to track 1 forward and 1 backward stream per 4k page, IIRC.) Probably about 4 to 8 ish streams is ok. But if d[][]
isn't missing in L1d, then it's not really an input stream from memory. You don't want your b[][]
input rows to be evicting the d
data, though, since you'll be looping over 2 to 4 rows of d
data repeatedly.
Soonts's current loop with 3 loads and 3 ALU operations isn't ideal, although 1 load per FMA operation is already ok if they hit in cache (most modern CPUs can do 2 each per clock, although AMD Zen can also do 2 FP adds in parallel with mul/fma). If that extra ALU operation was a bottleneck, we could pre-multiply a[][]
by 2 once, taking only O(N*K)
work vs. O(N^2*K)
to do it on the fly. But it's probably not a bottleneck and thus not worth it.
More importantly, the memory access pattern in Soonts's current answer is looping forward 1 double at a time for broadcast loads of c[k]
and a[i][k]
which is good, but the bv = _mm256_loadu_pd
of b[k][j + 0..3]
is unfortunately striding down a column.
If you're going to unroll as Soonts suggested, don't just do two dep chains for one dv
, do at least two vectors, d[i][j + 0..3]
and 4..7
so you use a whole 64 bytes (full cache line) from every b[k][j]
you touch. Or four vectors for a pair of cache-lines. (Intel CPUs at least use an adjacent-line prefetcher, which likes to complete a 128-byte aligned pair of cache lines, so you'd benefit from aligning the rows of b[][]
by 128. Or at least by 64, and get some benefit from adjacent-line prefetching.
If a vertical slice of b[][]
fits in some level of cache (along with the row of d[i][]
you're currently accumulating into), the next stride down the next group of columns can benefit from that prefetching and locality. If not, fully using the lines you touch is more important, so they don't have to get pulled in again later.
So with Soonts's vectorization strategy, for large problems where this won't fit in L1d cache, probably good to make sure b
's rows are aligned by 64, even if that means padding at the end of each row. (The storage geometry doesn't have to match the actual matrix dimension; you pass N
and row_stride
separately. You use one for index calculations, the other for loop bounds.)