Summing vec4[idx[i]] * scalar[i] with YMM vector registers

I am trying to optimize the following sum{vec4[indexarray[i]] * scalar[i]}, where vec4 is a float[4], and scalar is a float. With 128 bit registers, this boils down to

sum = _mm_fmadd_ps(
            _mm_loadu_ps(vec4[indexarray[i]]),
            _mm_set_ps1(scalar[i]),
            sum);

If I want to execute the FMA on 256 bit registers, I would have to do something like

__m256 coef = _mm256_set_m128(
                    _mm_set_ps1(scalar[2 * i + 0]),
                    _mm_set_ps1(scalar[2 * i + 1]));
__m256 vec = _mm256_set_m128(
                    _mm_loadu_ps(vec4[indexarray[2 * i + 0]]),
                    _mm_loadu_ps(vec4[indexarray[2 * i + 1]]));
sum = _mm256_fmadd_ps(vec, coef, sum);

along with a shuffle and add at the end to sum the upper and lower lanes.

Theoretically, I gain 5 in latency (assuming Haswell architecture) from the single FMA, but lose 2x3 in latency from the _mm256_set_m128.

Is there a way to make this any faster using the ymm registers or are all gains from the single FMA going to offset with interest from combining the xmm registers?

Solution

but lose 2x3 in latency from the _mm256_set_m128

No, that latency isn't on the critical path; it's part of preparing an input for FMA. The concern with doing more shuffles for each independent i value is throughput.

Latency only really matters for the loop carried dependency chain through sum, which is both an input and output from the FMA.

The inputs that only depend on i and memory contents can be handled in parallel across multiple iterations by out-of-order execution.

You might still come out ahead, though, building 256-bit vectors. However you write the source (_mm256_set_m128 isn't a real instruction), it will probably bottleneck on the front-end or on 1-per-clock shuffle throughput. You want it to compile to a 128-bit load and then vinsertf128 ymm, ymm, [mem], 1 to insert the high half of a vector. vinsertf128 does cost a shuffle uop.

If you bottleneck on latency with 128-bit registers, you might be better off just using multiple accumulators so (on Haswell) up to 10 FMAs can be in flight at once: 5c latency * 0.5c throughput. Add them at the end. Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?