Search code examples
assemblyx86ssesimdavx

What are the best instruction sequences to generate vector constants on the fly?


"Best" means fewest instructions (or fewest uops, if any instructions decode to more than one uop). Machine-code size in bytes is a tie-breaker for equal insn count.

Constant-generation is by its very nature the start of a fresh dependency chain, so it's unusual for latency to matter. It's also unusual to generate constants inside a loop, so throughput and execution-port demands are also mostly irrelevant.

Generating constants instead of loading them takes more instructions (except for all-zero or all-one), so it does consume precious uop-cache space. This can be an even more limited resource than data cache.

Agner Fog's excellent Optimizing Assembly guide covers this in Section 13.8. Table 13.9 has sequences for generating vectors where every element is 0, 1, 2, 3, 4, -1, or -2, with element sizes from 8 to 64 bits. Table 13.11 has sequences for generating some floating point values (0.0, 0.5, 1.0, 1.5, 2.0, -2.0, and bitmasks for the sign bit.)

Agner Fog's sequences only use SSE2, either by design or because it hasn't been updated for a while.

What other constants can be generated with short non-obvious sequences of instructions? (Further extensions with different shift counts are obvious and not "interesting".) Are there better sequences for generating the constants Agner Fog does list?

How to move 128-bit immediates to XMM registers illustrates some ways to put an arbitrary 128b constant into the instruction stream, but that's usually not sensible (it doesn't save any space, and takes up lots of uop-cache space.)


Solution

  • All-zero: pxor xmm0,xmm0 (or xorps xmm0,xmm0, one instruction-byte shorter.) There isn't much difference on modern CPUs, but on Nehalem (before xor-zero elimination), the xorps uop could only run on port 5. I think that's why compilers favour pxor-zeroing even for registers that will be used with FP instructions. xor-zeroing is as cheap as a NOP on modern CPUs.

    All-ones: pcmpeqw xmm0,xmm0. This is the usual starting point for generating other constants, because (like pxor) it breaks the dependency on the previous value of the register (except on old CPUs like K10 and pre-Core2 P6). It does need an execution port to write ones to a physical register, unlike zeroing on some CPUs, but either way it's one uop for the front-end.

    There's no advantage to the W version over the byte or dword element size versions of pcmpeq on any CPU in Agner Fog's instruction tables, but pcmpeqQ takes an extra byte, is slower on Silvermont, and requires SSE4.1.


    The main difficulty is 8-bit vectors, because there's no PSLLB

    Update: with GFNI gf2p8affineqb, any set1_epi8 can be created from a zeroed or all-ones register. It can rearrange bits within bytes or gather bits within a qword, but we don't need any of that, we just care about the immediate byte being XORed with every byte of the vector as the final step.

    ; or vpcmpeqd  ymm0, ymm0, ymm0 ; or xmm, depending what other constants you want
        vpxor           xmm0, xmm0, xmm0        ; 4 bytes
        gf2p8affineqb   ymm1, ymm0, ymm0, 0x5a  ; 6 bytes
     ; ymm1 = _mm256_set1_epi8(0x5a)
    

    All-zeroes and all-ones both work (I tested with SDE), but arbitrary garbage doesn't affine-transform itself to zero. If you have a known repeating constant, you could use it as a source operand with your immediate being the XOR of what you want with what gf2p8affineqb produces with an immediate of 0. (But keep in mind that data loads can miss in cache; an extra xor-zeroing instruction lets out-of-order exec get started on everything independent of the other constant.)

    GFNI + AVX is available on Ice Lake / Zen 4 and later. Silvermont-family supported the SSE form of GFNI, so unlike AVX-512, Alder Lake and similar non-server CPUs do still support AVX + GFNI. On CPUs that support it, it's single-uop with 5 (Intel) or 3 cycle latency (AMD). Or 4-cycle on Intel E-cores (for the 128-bit version). https://uops.info/ Throughput is 2/clock for 256-bit, 1/clock for 512-bit on both Intel and AMD. (Except Intel E-cores, only 128-bit is 2/clock there.) So this is still fine for producing constants ahead of a loop; out-of-order exec can be working on this while vector data is loading.


    With SSSE3 for pabsb, and other legacy-SSE tricks

    Agner Fog's table generates vectors of 16-bit elements and uses packuswb to work around this. For example, pcmpeqw xmm0,xmm0 / psrlw xmm0,15 / psllw xmm0,1 / packuswb xmm0,xmm0 generates a vector where every byte is 2. (This pattern of shifts, with different counts, is the main way to produce most constants for wider vectors). There is a better way:

    paddb xmm0,xmm0 (SSE2) works as a left-shift by one with byte granularity, so a vector of -2 bytes can be generated with only two instructions (pcmpeqw / paddb). paddw/d/q as a left-shift-by-one for other element sizes saves one byte of machine code compared to shifts, and can generally run on more ports than a shift-imm.

    pabsb xmm0,xmm0 (SSSE3) turns a vector of all-ones (-1) into a vector of 1 bytes, and is non-destructive so you still have the set1(-1) vector.

    (You sometimes don't need set1(1). You can add 1 to every element by subtracting -1 with psubb instead.)

    We can generate 2 bytes with pcmpeqw / paddb / pabsb. (Order of add vs. abs doesn't matter). pabs doesn't need an imm8, but only saves code bytes for other element widths vs. right shifting when both require a 3-byte VEX prefix. This only happens when the source register is xmm8-15. (vpabsb/w/d always requires a 3-byte VEX prefix for VEX.128.66.0F38.WIG, but vpsrlw dest,src,imm can otherwise use a 2-byte VEX prefix for its VEX.NDD.128.66.0F.WIG).

    We can actually save instructions in generating power-of-2 bytes like 4, too: pcmpeqw / pabsb / psllw xmm0, 2. All the bits that are shifted across byte boundaries by the word-shift are zero, thanks to pabsb. Obviously other shift counts can put the single set-bit at other locations, including the sign bit to generate a vector of -128 (0x80) bytes. Note that pabsb is non-destructive (the destination operand is write-only, and doesn't need to be the same as the source to get the desired behaviour). You can keep the all-ones around as a constant, or as the start of generating another constant, or as a source operand for psubb (to increment by one).

    A vector of 0x80 bytes can be also (see prev paragraph) be generated from anything that saturates to -128, using packsswb. e.g. if you already have a vector of 0xFF00 for something else, just copy it and use packsswb. Constants loaded from memory that happen to saturate correctly are potential targets for this.

    A vector of 0x7f bytes can be generated with pcmpeqw / paddb xmm0,xmm0 / psrlw xmm0, 1. This is slightly better than pcmpeqw / psrlw xmm0, 9 / packuswb xmm0,xmm0, the usual trick of generating the value in each word and using packuswb. But PADDB can run on more ports than PACK on most CPUs.

    pavgb (SSE2) against a zeroed register can right-shift by one, but only if the value is even. (It does unsigned dst = (dst+src+1)>>1 for rounding, with 9-bit internal precision for the temporary.) This doesn't seem to be useful for constant-generation, though, because 0xff is odd: pxor xmm1,xmm1 / pcmpeqw xmm0,xmm0 / paddb xmm0,xmm0 / pavgb xmm0, xmm1 produces 0x7f bytes with one more insn than shift/pack. If a zeroed register is already needed for something else, though, paddb / pavgb does save one instruction byte.


    I have tested these sequences. The easiest way is to throw them in a .asm, assemble/link, and run gdb on it. layout asm, display /x $xmm0.v16_int8 to dump that after every single-step, and single-step instructions (ni or si). In layout reg mode, you can do tui reg vec to switch to a display of vector regs, but it's nearly useless because you can't select which interpretation to display (you always get all of them, and can't hscroll, and the columns don't line up between registers). It's excellent for integer regs/flags, though.


    Note that using these with intrinsics can be tricky. Compilers don't like to operate on uninitialized variables, so you should use _mm_undefined_si128() to tell the compiler that's what you meant. Or perhaps using _mm_set1_epi32(-1) will get your compiler to emit a pcmpeqd same,same. Without this, some compilers will xor-zero uninitialized vector variables before use, or even (MSVC) load uninitialized memory from the stack.


    Compacting memory constants with broadcast or widening loads

    Many constants can be stored more compactly in memory by taking advantage of SSE4.1's pmovzx or pmovsx for zero or sign-extension on the fly. For example, a 128b vector of {1, 2, 3, 4} as 32bit elements could be generated with a pmovzx load from a 32bit memory location. Memory operands can micro-fuse with pmovzx, so it doesn't take any extra fused-domain uops. It does prevent using the constant directly as a memory operand, though.

    C/C++ intrinsics support for using pmovz/sx as a load is terrible: there's _mm_cvtepu8_epi32 (__m128i a), but no version that takes a uint32_t * pointer operand. You can hack around it, but it's ugly and compiler optimization failure is a problem. See the linked question for details and links to the gcc bug reports.

    With 256b and (not so) soon 512b constants, the savings in memory are larger. This only matters very much if multiple useful constants can share a cache-line, though.

    The FP equivalent of this is VCVTPH2PS xmm1, xmm2/m64, requiring the F16C (half precision) feature flag. (There's also a store instruction that packs single to half, but no computation at half precision. It's a memory bandwidth / cache footprint optimization only.)


    Obviously when all elements are the same (but not suitable for generating on the fly), pshufd or AVX vbroadcastps / AVX2 vpbroadcastb/w/d/q/i128 are useful. pshufd can take a memory source operand, but it has to be 128b. movddup (SSE3) does a 64bit load, broadcast to fill a 128b register. On Intel, it doesn't need an ALU execution unit, only load port. (Similarly, AVX v[p]broadcast loads of dword size and larger are handled in the load unit, without ALU).

    Broadcasts or pmovz/sx are excellent for saving executable size when you're going to load a mask into a register for repeated use in a loop. Generating multiple similar masks from one starting point can also save space, if it only takes one instruction.

    See also For for an SSE vector that has all the same components, generate on the fly or precompute? which is asking more about using the set1 intrinsic, and it isn't clear if it's asking about constants or broadcasts of variables.

    I also experimented some with compiler output for broadcasts.


    If cache misses are a problem, take a look at your code and see if the compiler has duplicated _mm_set constants when the same function is inlined into different callers. Also watch out for constants that are used together (e.g. in functions called one after another) being scattered into different cache lines. Many scattered loads for constants is far worse than loading a lot of constants all from near each other.

    pmovzx and/or broadcast loads let you pack more constants into a cache line, with very low overhead for loading them into a register. The load won't be on the critical path, so even if it takes an extra uop, it can take a free execution unit at any cycle over a long window.

    clang actually does a good job of this: separate set1 constants in different functions are recognized as identical, the way identical string literals can be merged. Note that clang's asm source output appears to show each function having its own copy of the constant, but the binary disassembly shows that all those RIP-relative effective addresses are referencing the same location. For 256b versions of the repeated functions, clang also uses vbroadcastsd to only require an 8B load, at the expense of an extra instruction in each function. (This is at -O3, so clearly the clang devs have realized that size matters for performance, not just for -Os). IDK why it doesn't go down to a 4B constant with vbroadcastss, because that should be just as fast. Unfortunately, the vbroadcast don't simply come from part of the 16B constant the other functions used. This maybe makes sense: an AVX version of something could probably only merge some of its constants with an SSE version. It's better to leave the memory pages with SSE constants completely cold, and have the AVX version keep all its constants together. Also, it's a harder pattern-matching problem to be handled at assemble or link time (however it's done. I didn't read every directive to figure out which one enables the merging.)

    gcc 5.3 also merges constants, but doesn't use broadcast-loads to compress 32B constants. Again the 16B constant doesn't overlap with the 32B constant.


    GCC12 has started preferring constructing some vector constants on the fly if AVX is available, but starting with mov reg, imm64 / vmovq / shuffle. Even if the pattern is simple and repeated, it will use a bulky 64-bit immediate and a vpunpcklqdq instead of a 5-byte mov eax, imm32 and dword broadcast. Also ironically not constructing on the fly for a set1_epi16(0x00ff) which is trivial (pcmpeqd / psrlw xmm, 8). https://godbolt.org/z/78cMaxjMz

    With AVX-512 available, mov r,imm / vpbroadcastd x/y/zmm, eax is only 2 instructions. (The vpbroadcastd ymm0, eax is 1 uop on Intel, 2 uops on Zen 4). That makes it more attractive to construct vectors that way, and easier for compilers to come up with.