I just noticed absence of _mm256_insert_pd()
/_mm256_insert_ps()
/_mm_insert_pd()
, also _mm_insert_ps() exists but with some weird usage pattern.
While _mm_insert_epi32() and _mm256_insert_epi32() and other integer variants exist.
Is it some intentional idea of Intel not to implement float/double variants for some reason? And what is the correct and most-performant way to set single float/double at given position (not only 0th) of SSE/AVX registers?
I implemented following AVX-double variant of insert
, which works, but still maybe there is a better way to do this:
template <int I>
__m256d _mm256_insert_pd(__m256d a, double x) {
int64_t ix;
std::memcpy(&ix, &x, sizeof(x));
return _mm256_castsi256_pd(
_mm256_insert_epi64(_mm256_castpd_si256(a), ix, I)
);
}
As I see extract
float/double variants are also absent in SSE/AVX for some reason. I know only _mm_extract_ps() exists, but not others.
Do you know why insert
and extract
are absent for float/double SSE/AVX?
A scalar float/double is just the bottom element of an XMM/YMM register already, and there are various FP shuffle instructions including vinsertps
and vmovlhps
that can (in asm) do the insertion of a 32-bit or 64-bit element. There aren't versions of those which work on 256-bit YMM registers, though, and general 2-register shuffles aren't available until AVX-512, and only with a vector control.
Still much of the difficulty is in the intrinsics API, making it harder to get at the useful asm operations.
One not-bad way is to broadcast a scalar float or double and blend, partly because a broadcast is one of the ways that intrinsics already provide for getting a __m256d
that contains your scalar1.
Immediate-blend instructions can efficiently replace one element of another vector, even in the high half2. They have good throughput and latency, and back-end port distribution, on most AVX CPUs. They require SSE4.1, but with AVX they're always available.
(See also Agner Fog's VectorClass Library (VCL) for C++ templates for replacing an element of a vector; with various SSE / AVX feature levels. Including with runtime-variable index, but often designed to optimize down to something good for compile-time constants, e.g. a switch on the index like in Vec4f::insert()
)
float
into __m256
template <int pos>
__m256 insert_float(__m256 v, float x) {
__m256 xv = _mm256_set1_ps(x);
return _mm256_blend_ps(v, xv, 1<<pos);
}
The best case is with position=0. (Godbolt)
auto test2_merge_0(__m256 v, float x){
return insert_float<0>(v,x);
}
clang notices that the broadcast is redundant and optimizes it away:
test2_merge_0(float __vector(8), float):
vblendps ymm0, ymm0, ymm1, 1 # ymm0 = ymm1[0],ymm0[1,2,3,4,5,6,7]
ret
But clang gets too clever for its own good sometimes, and pessimizes this to
test2_merge_5(float __vector(8), float): # clang(trunk) -O3 -march=skylake
vextractf128 xmm2, ymm0, 1
vinsertps xmm1, xmm2, xmm1, 16 # xmm1 = xmm2[0],xmm1[0],xmm2[2,3]
vinsertf128 ymm0, ymm0, xmm1, 1
ret
Or when merging into a zeroed vector, clang uses vxorps
-zeroing and then a blend, but gcc does better:
test2_zero_0(float): # GCC(trunk) -O3 -march=skylake
vinsertps xmm0, xmm0, xmm0, 0xe
ret
Footnote 1:
Which is a problem for intrinsics; many intrinsics that you could use with a scalar float/double are only available with vector operands, and compilers don't always manage to optimize away _mm_set_ss
or _mm_set1_ps
or whatever when you only actually read the bottom element. A scalar float/double is either in memory or the bottom element of an X/YMM register already, so in asm it's 100% free to use vector shuffles on scalar floats / doubles that are already loaded into a register.
But there's no intrinsic to tell the compiler you want a vector with don't-care elements outside the bottom. This means you have to write your source in a way that looks like it's doing extra work, and rely on the compiler to optimize it away. How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?
Footnote 2:
Unlike vpinsrq
. As you can see from Godbolt, your version compiles very inefficiently, especially with GCC. They have to handle the high half of the __m256d
separately, although GCC finds way fewer optimizations and makes asm that's closer to your very inefficient code. BTW, make the function return
a __m256d
instead of assigning to a volatile
; that way you have less noise. https://godbolt.org/z/Wrn7n4soh)
_mm256_insert_epi64
is a "compound" intrinsic / helper function: vpinsrq
is only available in vpinsrq xmm, xmm, r/m64, imm8
form, which zero-extends the xmm register into the full Y/ZMM. Even clang's shuffle optimizer (which finds vmovlhps
to replace the high half of an XMM with the low half of another XMM) still ends up extracting and re-inserting the high half when you blend into an existing vector instead of zero.
The asm situation is that the scalar operand for extractps
is r/m32
, not an XMM register, so it's not useful for extracting a scalar float (except to store it to memory). See my answer on the Q&A Intel SSE: Why does `_mm_extract_ps` return `int` instead of `float`? for more about it and insertps
.
insertps xmm, xmm/m32, imm
can select a source float from another vector register, so the only intrinsic takes two vectors, leaving you with the How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics? problem of convincing the compiler not to waste instructions setting elements in a __m128
when you only care about the bottom one.