Packing non-contiguous vector elements in AVX (and higher)

Having codes of this nature:

void foo(double *restrict A, double *restrict x,
                             double *restrict y) {
  y[5] += A[4] * x[5];
  y[5] += A[5] * x[1452];
  y[5] += A[6] * x[3373];
}

The result of the compilation using gcc 10.2 and flags -O3 -mfma -mavx2 -fvect-cost-model=unlimited (Compiler Explorer), is:

foo(double*, double*, double*):
        vmovsd  xmm1, QWORD PTR [rdx+40]
        vmovsd  xmm0, QWORD PTR [rdi+32]
        vfmadd132sd     xmm0, xmm1, QWORD PTR [rsi+40]
        vmovsd  xmm2, QWORD PTR [rdi+40]
        vfmadd231sd     xmm0, xmm2, QWORD PTR [rsi+11616]
        vmovsd  xmm3, QWORD PTR [rdi+48]
        vfmadd231sd     xmm0, xmm3, QWORD PTR [rsi+26984]
        vmovsd  QWORD PTR [rdx+40], xmm0
        ret

It does not pack any data together (4 vmovsd for loading data and 1 for storing), performing 3 vfmaddXXXsd. Nonetheless, my motivation to vectorize this is that it could be done using only one vfmadd231pd. My "cleanest" attempt to write this code using intrinsics for AVX2 is:

void foo_intrin(double *restrict A, double *restrict x,
                            double *restrict y) {
  __m256d __vop0, __vop1,__vop2;
  __m128d __lo256, __hi256;

  // THE ISSUE
  __vop0 = _mm256_maskload_pd(&A[4], _mm256_set_epi64x(0,-1,-1,-1));
  __vop1 = _mm256_mask_i64gather_pd(_mm256_setzero_pd(), &x[5], 
                                    _mm256_set_epi64x(0,3368, 1447, 0), 
                                    _mm256_set_pd(0,-1,-1,-1), 8);
  // 1 vs 3 FMADD, "the gain"
  __vop2 = _mm256_fmadd_pd(__vop0, __vop1, __vop2);

  // reducing 4 double elements: 
  // Peter Cordes' answer https://stackoverflow.com/a/49943540/2856041
  __lo256 = _mm256_castpd256_pd128(__vop2);
  __hi256 = _mm256_extractf128_pd(__vop2, 0x1);
  __lo256 = _mm_add_pd(__lo256, __hi256);

  // question:
  // could you use here shuffle instead?
  // __hi256 = _mm_shuffle_pd(__lo256, __lo256, 0x1);
  __hi256 = _mm_unpackhi_pd(__lo256, __lo256);


  __lo256 = _mm_add_pd(__lo256, __hi256);
  
  y[5] += __lo256[0];
}

Which generates the following ASM:

foo_intrin(double*, double*, double*):
        vmovdqa ymm2, YMMWORD PTR .LC1[rip]
        vmovapd ymm3, YMMWORD PTR .LC2[rip]
        vmovdqa ymm0, YMMWORD PTR .LC0[rip]
        vmaskmovpd      ymm1, ymm0, YMMWORD PTR [rdi+32]
        vxorpd  xmm0, xmm0, xmm0
        vgatherqpd      ymm0, QWORD PTR [rsi+40+ymm2*8], ymm3
        vxorpd  xmm2, xmm2, xmm2
        vfmadd132pd     ymm0, ymm2, ymm1
        vmovapd xmm1, xmm0
        vextractf128    xmm0, ymm0, 0x1
        vaddpd  xmm0, xmm0, xmm1
        vunpckhpd       xmm1, xmm0, xmm0
        vaddpd  xmm0, xmm0, xmm1
        vaddsd  xmm0, xmm0, QWORD PTR [rdx+40]
        vmovsd  QWORD PTR [rdx+40], xmm0
        vzeroupper
        ret
.LC0:
        .quad   -1
        .quad   -1
        .quad   -1
        .quad   0
.LC1:
        .quad   0
        .quad   1447
        .quad   3368
        .quad   0
.LC2:
        .long   0
        .long   -1074790400
        .long   0
        .long   -1074790400
        .long   0
        .long   -1074790400
        .long   0
        .long   0

Sorry if anyone has an anxiety attack right now, I am deeply sorry. Let's break this down:

I guess those vxorpd are for cleaning registers, but icc just generate one, not two.
According to Agner Fog, VCL does not use maskload in AVX2 since "masked instructions are very slow in instruction sets prior to AVX512". However, in uops.info it is reported, for Skylake ("regular", no AVX-512), that:
- VMOVAPD (YMM, M256), e.g. _mm256_load_pd has latencies [≤5;≤8] and throughput of 0.5.
- VMASKMOVPD (YMM, YMM, M256), e.g. _mm256_maskload_pd has latencies [1;≤9] and throughput of 0.5 as well, but decoded in two uops instead of one. Is this difference so huge? Is it better to pack in a different manner?
Regarding the mask_gather-fashion instructions, as far as I understand with all documentation above, it delivers the same performance whether it uses mask or not, is this correct? Both uops.info and Intel Intrinsics Guide report the same performance and ASM form; it is probable that I am missing something.
- Is it better a gather than a "simple" set in all cases? Talking in intrinsics terms. I am aware that set generates vmov-type instructions depending on the type of data (e.g. if data are constants it may just load an address, as in .LC0, .LC1 and .LC2).
According to Intel Intrinsics, _mm256_shuffle_pd and _mm256_unpackhi_pd have the same lantecy and throughtput; the first one generates a vpermildp and the second one vunpckhpd, and uops.info also reports same values. Is there any difference?

Last but not least, is this ad-hoc vectorization worth it? I just do not mean my intrinsics code, but the concept of vectorizing codes like this. I suspect that there are too many data movements to perform comparing the clean code compilers, in general, produce, so my concern is to improve the way of packing that non-contigous data.

Solution

vfmaddXXXsd and pd instructions are "cheap" (single uop, 2/clock throughput), even cheaper than shuffles (1/clock throughput on Intel CPUs) or gather-loads. https://uops.info/. Load operations are also 2/clock, so lots of scalar loads (especially from the same cache line) are quite cheap, and notice how 3 of them can fold into memory source operands for FMAs.

Worst case, packing 4 (x2) totally non-contiguous inputs and then manually scattering the outputs is definitely not worth it vs. just using scalar loads and scalar FMAs (especially when that allows memory source operands for the FMAs).

Your case is far from the worst case; you have 3 contiguous elements from 1 input. If you know you can safely load 4 elements without risk of touching an unmapped page, that takes care of that input. (And you can always use maskload). But the other vector is still non-contiguous and may be a showstopper for speedups.

It's usually not worth it if it would take more total instructions (actually uops) to do it via shuffling than plain scalar. And/or if shuffle throughput would be a worse bottleneck than anything in the scalar version.

(vgatherdpd counts as many instructions for this, being multi-uop and doing 1 cache access per load. Also you'd have to load constant vectors of indices instead of hard-coding offsets into addressing modes.

Also, gathers are quite slow on AMD CPUs, even Zen2. We don't have scatter at all until AVX512, and those are slow even on Ice Lake. Your case doesn't need scatters, though, just a horizontal sum. Which will involve more shuffles and vaddpd / sd. So even with a maskload + gather for inputs, having 3 products in separate vector elements is not particularly convenient for you.)

A little bit of SIMD (not a whole array, just a few operations) can be helpful, but this doesn't look like one of the cases where it's a significant win. Maybe there's something worth doing, like maybe replace 2 loads with a load + a shuffle. Or maybe shorten a latency chain for y[5] by summing the 3 products before adding to the output, instead of the chain of 3 FMAs. That might even be numerically better, in cases where an accumulator can hold a large number; adding multiple small numbers to a big total loses precision. Of course that would cost 1 mul, 2 FMA, and 1 add.