Search code examples
c++floating-pointssesimdavx

No insert and extract for float/double in SSE and AVX?


I just noticed absence of _mm256_insert_pd()/_mm256_insert_ps()/_mm_insert_pd(), also _mm_insert_ps() exists but with some weird usage pattern.

While _mm_insert_epi32() and _mm256_insert_epi32() and other integer variants exist.

Is it some intentional idea of Intel not to implement float/double variants for some reason? And what is the correct and most-performant way to set single float/double at given position (not only 0th) of SSE/AVX registers?

I implemented following AVX-double variant of insert, which works, but still maybe there is a better way to do this:

Try it online!

template <int I>
__m256d _mm256_insert_pd(__m256d a, double x) {
    int64_t ix;
    std::memcpy(&ix, &x, sizeof(x));
    return _mm256_castsi256_pd(
        _mm256_insert_epi64(_mm256_castpd_si256(a), ix, I)
    );
}

As I see extract float/double variants are also absent in SSE/AVX for some reason. I know only _mm_extract_ps() exists, but not others.

Do you know why insert and extract are absent for float/double SSE/AVX?


Solution

  • A scalar float/double is just the bottom element of an XMM/YMM register already, and there are various FP shuffle instructions including vinsertps and vmovlhps that can (in asm) do the insertion of a 32-bit or 64-bit element. There aren't versions of those which work on 256-bit YMM registers, though, and general 2-register shuffles aren't available until AVX-512, and only with a vector control.

    Still much of the difficulty is in the intrinsics API, making it harder to get at the useful asm operations.


    One not-bad way is to broadcast a scalar float or double and blend, partly because a broadcast is one of the ways that intrinsics already provide for getting a __m256d that contains your scalar1.

    Immediate-blend instructions can efficiently replace one element of another vector, even in the high half2. They have good throughput and latency, and back-end port distribution, on most AVX CPUs. They require SSE4.1, but with AVX they're always available.

    (See also Agner Fog's VectorClass Library (VCL) for C++ templates for replacing an element of a vector; with various SSE / AVX feature levels. Including with runtime-variable index, but often designed to optimize down to something good for compile-time constants, e.g. a switch on the index like in Vec4f::insert())


    float into __m256

    template <int pos>
    __m256 insert_float(__m256 v, float x) {
        __m256 xv = _mm256_set1_ps(x);
        return _mm256_blend_ps(v, xv, 1<<pos);
    }
    

    The best case is with position=0. (Godbolt)

    auto test2_merge_0(__m256 v, float x){
        return insert_float<0>(v,x);
    }
    

    clang notices that the broadcast is redundant and optimizes it away:

    test2_merge_0(float __vector(8), float):
            vblendps        ymm0, ymm0, ymm1, 1             # ymm0 = ymm1[0],ymm0[1,2,3,4,5,6,7]
            ret
    

    But clang gets too clever for its own good sometimes, and pessimizes this to

    test2_merge_5(float __vector(8), float):  # clang(trunk) -O3 -march=skylake
            vextractf128    xmm2, ymm0, 1
            vinsertps       xmm1, xmm2, xmm1, 16    # xmm1 = xmm2[0],xmm1[0],xmm2[2,3]
            vinsertf128     ymm0, ymm0, xmm1, 1
            ret
    

    Or when merging into a zeroed vector, clang uses vxorps-zeroing and then a blend, but gcc does better:

    test2_zero_0(float):           # GCC(trunk) -O3 -march=skylake
            vinsertps       xmm0, xmm0, xmm0, 0xe
            ret
    

    Footnote 1:
    Which is a problem for intrinsics; many intrinsics that you could use with a scalar float/double are only available with vector operands, and compilers don't always manage to optimize away _mm_set_ss or _mm_set1_ps or whatever when you only actually read the bottom element. A scalar float/double is either in memory or the bottom element of an X/YMM register already, so in asm it's 100% free to use vector shuffles on scalar floats / doubles that are already loaded into a register.

    But there's no intrinsic to tell the compiler you want a vector with don't-care elements outside the bottom. This means you have to write your source in a way that looks like it's doing extra work, and rely on the compiler to optimize it away. How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?

    Footnote 2:
    Unlike vpinsrq. As you can see from Godbolt, your version compiles very inefficiently, especially with GCC. They have to handle the high half of the __m256d separately, although GCC finds way fewer optimizations and makes asm that's closer to your very inefficient code. BTW, make the function return a __m256d instead of assigning to a volatile; that way you have less noise. https://godbolt.org/z/Wrn7n4soh)

    _mm256_insert_epi64 is a "compound" intrinsic / helper function: vpinsrq is only available in vpinsrq xmm, xmm, r/m64, imm8 form, which zero-extends the xmm register into the full Y/ZMM. Even clang's shuffle optimizer (which finds vmovlhps to replace the high half of an XMM with the low half of another XMM) still ends up extracting and re-inserting the high half when you blend into an existing vector instead of zero.


    The asm situation is that the scalar operand for extractps is r/m32, not an XMM register, so it's not useful for extracting a scalar float (except to store it to memory). See my answer on the Q&A Intel SSE: Why does `_mm_extract_ps` return `int` instead of `float`? for more about it and insertps.

    insertps xmm, xmm/m32, imm can select a source float from another vector register, so the only intrinsic takes two vectors, leaving you with the How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics? problem of convincing the compiler not to waste instructions setting elements in a __m128 when you only care about the bottom one.