Having codes of this nature:
void foo(double *restrict A, double *restrict x,
double *restrict y) {
y[5] += A[4] * x[5];
y[5] += A[5] * x[1452];
y[5] += A[6] * x[3373];
}
The result of the compilation using gcc 10.2
and flags -O3 -mfma -mavx2 -fvect-cost-model=unlimited
(Compiler Explorer), is:
foo(double*, double*, double*):
vmovsd xmm1, QWORD PTR [rdx+40]
vmovsd xmm0, QWORD PTR [rdi+32]
vfmadd132sd xmm0, xmm1, QWORD PTR [rsi+40]
vmovsd xmm2, QWORD PTR [rdi+40]
vfmadd231sd xmm0, xmm2, QWORD PTR [rsi+11616]
vmovsd xmm3, QWORD PTR [rdi+48]
vfmadd231sd xmm0, xmm3, QWORD PTR [rsi+26984]
vmovsd QWORD PTR [rdx+40], xmm0
ret
It does not pack any data together (4 vmovsd
for loading data and 1 for storing), performing 3 vfmaddXXXsd
. Nonetheless, my motivation to vectorize this is that it could be done using only one vfmadd231pd
. My "cleanest" attempt to write this code using intrinsics for AVX2 is:
void foo_intrin(double *restrict A, double *restrict x,
double *restrict y) {
__m256d __vop0, __vop1,__vop2;
__m128d __lo256, __hi256;
// THE ISSUE
__vop0 = _mm256_maskload_pd(&A[4], _mm256_set_epi64x(0,-1,-1,-1));
__vop1 = _mm256_mask_i64gather_pd(_mm256_setzero_pd(), &x[5],
_mm256_set_epi64x(0,3368, 1447, 0),
_mm256_set_pd(0,-1,-1,-1), 8);
// 1 vs 3 FMADD, "the gain"
__vop2 = _mm256_fmadd_pd(__vop0, __vop1, __vop2);
// reducing 4 double elements:
// Peter Cordes' answer https://stackoverflow.com/a/49943540/2856041
__lo256 = _mm256_castpd256_pd128(__vop2);
__hi256 = _mm256_extractf128_pd(__vop2, 0x1);
__lo256 = _mm_add_pd(__lo256, __hi256);
// question:
// could you use here shuffle instead?
// __hi256 = _mm_shuffle_pd(__lo256, __lo256, 0x1);
__hi256 = _mm_unpackhi_pd(__lo256, __lo256);
__lo256 = _mm_add_pd(__lo256, __hi256);
y[5] += __lo256[0];
}
Which generates the following ASM:
foo_intrin(double*, double*, double*):
vmovdqa ymm2, YMMWORD PTR .LC1[rip]
vmovapd ymm3, YMMWORD PTR .LC2[rip]
vmovdqa ymm0, YMMWORD PTR .LC0[rip]
vmaskmovpd ymm1, ymm0, YMMWORD PTR [rdi+32]
vxorpd xmm0, xmm0, xmm0
vgatherqpd ymm0, QWORD PTR [rsi+40+ymm2*8], ymm3
vxorpd xmm2, xmm2, xmm2
vfmadd132pd ymm0, ymm2, ymm1
vmovapd xmm1, xmm0
vextractf128 xmm0, ymm0, 0x1
vaddpd xmm0, xmm0, xmm1
vunpckhpd xmm1, xmm0, xmm0
vaddpd xmm0, xmm0, xmm1
vaddsd xmm0, xmm0, QWORD PTR [rdx+40]
vmovsd QWORD PTR [rdx+40], xmm0
vzeroupper
ret
.LC0:
.quad -1
.quad -1
.quad -1
.quad 0
.LC1:
.quad 0
.quad 1447
.quad 3368
.quad 0
.LC2:
.long 0
.long -1074790400
.long 0
.long -1074790400
.long 0
.long -1074790400
.long 0
.long 0
Sorry if anyone has an anxiety attack right now, I am deeply sorry. Let's break this down:
vxorpd
are for cleaning registers, but icc
just generate one, not two.maskload
in AVX2 since "masked instructions are very slow in instruction sets prior to AVX512". However, in uops.info it is reported, for Skylake ("regular", no AVX-512), that:
_mm256_load_pd
has latencies [≤5;≤8] and throughput of 0.5._mm256_maskload_pd
has latencies [1;≤9] and throughput of 0.5 as well, but decoded in two uops instead of one. Is this difference so huge? Is it better to pack in a different manner?mask_gather
-fashion instructions, as far as I understand with all documentation above, it delivers the same performance whether it uses mask or not, is this correct? Both uops.info and Intel Intrinsics Guide report the same performance and ASM form; it is probable that I am missing something.
gather
than a "simple" set
in all cases? Talking in intrinsics terms. I am aware that set
generates vmov
-type instructions depending on the type of data (e.g. if data are constants it may just load an address, as in .LC0
, .LC1
and .LC2
)._mm256_shuffle_pd
and _mm256_unpackhi_pd
have the same lantecy and throughtput; the first one generates a vpermildp
and the second one vunpckhpd
, and uops.info also reports same values. Is there any difference?Last but not least, is this ad-hoc vectorization worth it? I just do not mean my intrinsics code, but the concept of vectorizing codes like this. I suspect that there are too many data movements to perform comparing the clean code compilers, in general, produce, so my concern is to improve the way of packing that non-contigous data.
vfmaddXXXsd
and pd
instructions are "cheap" (single uop, 2/clock throughput), even cheaper than shuffles (1/clock throughput on Intel CPUs) or gather-loads. https://uops.info/. Load operations are also 2/clock, so lots of scalar loads (especially from the same cache line) are quite cheap, and notice how 3 of them can fold into memory source operands for FMAs.
Worst case, packing 4 (x2) totally non-contiguous inputs and then manually scattering the outputs is definitely not worth it vs. just using scalar loads and scalar FMAs (especially when that allows memory source operands for the FMAs).
Your case is far from the worst case; you have 3 contiguous elements from 1 input. If you know you can safely load 4 elements without risk of touching an unmapped page, that takes care of that input. (And you can always use maskload). But the other vector is still non-contiguous and may be a showstopper for speedups.
It's usually not worth it if it would take more total instructions (actually uops) to do it via shuffling than plain scalar. And/or if shuffle throughput would be a worse bottleneck than anything in the scalar version.
(vgatherdpd
counts as many instructions for this, being multi-uop and doing 1 cache access per load. Also you'd have to load constant vectors of indices instead of hard-coding offsets into addressing modes.
Also, gathers are quite slow on AMD CPUs, even Zen2. We don't have scatter at all until AVX512, and those are slow even on Ice Lake. Your case doesn't need scatters, though, just a horizontal sum. Which will involve more shuffles and vaddpd
/ sd
. So even with a maskload + gather for inputs, having 3 products in separate vector elements is not particularly convenient for you.)
A little bit of SIMD (not a whole array, just a few operations) can be helpful, but this doesn't look like one of the cases where it's a significant win. Maybe there's something worth doing, like maybe replace 2 loads with a load + a shuffle. Or maybe shorten a latency chain for y[5]
by summing the 3 products before adding to the output, instead of the chain of 3 FMAs. That might even be numerically better, in cases where an accumulator can hold a large number; adding multiple small numbers to a big total loses precision. Of course that would cost 1 mul, 2 FMA, and 1 add.