_mm256_insert_epi32() has no effect

I started coding for AVX2 on x86 using GCC 12 on Linux. Everything works as expected. Except the following snippet:

#include <iostream>
#include <immintrin.h>
    
__m256i aVector = _mm256_setzero_si256();
_mm256_insert_epi32(aVector, 0x80000000, 0);
_mm256_insert_epi32(aVector, 0x83333333, 3);
_mm256_insert_epi32(aVector, 0x87777777, 7);
    
alignas(__m256i) uint32_t aArray[8];
_mm256_store_si256((__m256i*)aArray, aVector);
    
std::cout << aArray[0] << ", " << aArray[1] << ", " << aArray[2] << ", " 
          << aArray[3] << ", " << aArray[4] << ", " << aArray[5] << ", " 
          << aArray[6] << ", " << aArray[7] << std::endl;

I expected to see the inserted numbers in the output. But I get the following:

0, 0, 0, 0, 0, 0, 0, 0

I have no clue what is going wrong. I don't get any errors or warnings. A code variant with 64 bit lanes has the same behavior.

Why do the inserts have no effect?

Solution

The modified vector is the return value, v = _mm256_insert_epi32(v, x, 3);
The intrinsics guide has prototypes, and see other links in https://stackoverflow.com/tags/sse/info.
__m256i _mm256_insert_epi32 (__m256i a, __int32 i, const int index).

No Intel intrinsic with a lower-case name ever modifies its arg by reference; the lower-case name ones are (or can be¹) C functions, and C doesn't have reference args. If they have one output, it's the return value. If they have more than one output, there will be a return value and a pointer arg, like _addcarry_u64 which return the carry and has an unsigned __int64 * out arg. (It often doesn't compile efficiently, but it's the by-value carry return that's the problem, with compilers often using setc to materialize the carry into an integer register.)

There are a few all-caps named intrinsics which are CPP macros, following the common convention that ALLCAPS names are macros, other names aren't (except maybe as an implementation detail). The most useful one being _MM_SHUFFLE() which stuffs four integers into the 2-bit fields of an immediate for pshufd, shufps, vpermq, etc. And at least a couple of which modify their args, like _MM_TRANSPOSE4_PS(__m128, __m128, __m128, __m128) (guide)

FYI it's not a very efficient way to insert a constant even for one element. There's no single instruction for it; vpinsrd only exists with an XMM destination which zero-extends to 256-bit. (Or legacy-SSE pinsrd which would leave the upper half unmodified but cause an SSE/AVX transition stall on some microarchitectures. Compilers won't use the legacy-SSE form to insert into the low half even when it would be fast, e.g. -mtune=skylake or -mtune=znver1.)

A much faster way to insert three constants would be one _mm256_blend_epi32 with a vector that has the elements you want to insert. vpblendd is single uop for any vector execution port on Intel, so 3/clock throughput and 1 cycle latency. vs. vpinsrd alone is 2 uops, a shuffle and an integer->xmm transfer, both uops can only run on port 5 on Intel.

Hopefully clang optimizes the inserts to a blend... Godbolt: close, it did the two in the low 128-bit lane in one blend, but left the high element for a separately blend. Like it was trying to save space on constants, but still ended up using a 32-byte constant with 16 bytes of zeros for the top half.

__m256i manual_blend(__m256i aVector){
    __m256i vconst = _mm256_set_epi32(0x87777777, 0x86666666, 0x85555555, 0x84444444,
                                      0x83333333, 0x82222222, 0x81111111, 0x80000000);
    return _mm256_blend_epi32(aVector, vconst, 0b1000'1001);
   // take values from the second vector where bits are set in the immediate
   // could write it as (1<<7) | (1<<3) | (1<<0)
}

# GCC  -O2  -Wall -march=x86-64-v3
manual_blend(long long __vector(4)):
        vpblendd        ymm0, ymm0, YMMWORD PTR .LC3[rip], 137
        ret

vs. a similar function with 3 inserts, taking a vector arg and returning a modified version (in YMM0).

# GCC -O2  -Wall -march=x86-64-v3
bar(long long __vector(4)):
        mov     eax, -2147483648
        vpinsrd xmm1, xmm0, eax, 0       # insert into the low half, keeping the orig unmodified in YMM0
        mov     eax, -2093796557
        vextracti128    xmm0, ymm0, 0x1  # get the high half of the original
        vpinsrd xmm1, xmm1, eax, 3         # second insert into low half
        mov     eax, -2022213769
        vpinsrd xmm0, xmm0, eax, 3       # insert into the high half
        vinserti128     ymm0, ymm1, xmm0, 0x1   # recombine halves
        ret

GCC does a pretty good job here for naively using vpinsrd, optimizing across multiple inserts to only extract and put back the high half once, not between each insert.

# -O2  -Wall -march=x86-64-v3
bar(long long __vector(4)):
        vblendps        ymm0, ymm0, ymmword ptr [rip + .LCPI1_0], 9 # ymm0 = mem[0],ymm0[1,2],mem[3],ymm0[4,5,6,7]
        vbroadcastss    ymm1, dword ptr [rip + .LCPI1_1] # ymm1 = [2272753527,2272753527,2272753527,2272753527,2272753527,2272753527,2272753527,2272753527]
        vblendps        ymm0, ymm0, ymm1, 128           # ymm0 = ymm0[0,1,2,3,4,5,6],ymm1[7]
        ret

Clang unfortunately uses FP blends (blendps) even on integer vectors; if part of a dependency chain involving actual SIMD-integer instructions like vpaddd (_mm256_add_epi32), this would cost an extra cycle of latency forwarding to the blend and forwarding from, on some Intel CPUs like Skylake. (The ...ps packed-single encoding is smaller than the equivalent ...pd packed-double or p... integer for non-AVX with SSE1 (movaps vs. movdqa), otherwise they're equal size in machine code. But usually it doesn't hurt so it's fine to always do it. For blends it does hurt but doesn't save space. Also potentially hurts performance for bitwise booleans on some microarchitectures, IIRC. Like maybe Sandybridge or Haswell for throughput of vandps vs. vpand.)

Footnote 1:

In debug builds, intrinsics with immediate operands need to be macros in GCC's immintrin.h since even an always_inline function can't get constant-propagation to make the arg to the GCC __builtin_ia32_... builtin an actual compile-time constant. But in optimized builds, GCC headers use a function definition; there's an #ifdef and a second set of definitions for the intrinsics that need a constant.