assembly optimization x86-64 micro-optimization avx512

Filling an AVX512 register with incrementing bytes

Are there any non-obvious tricks to fill an AVX512 register with incrementing bytes (little-endian)? That is, the equivalent of this code:

__m512i make_incrementing_bytes(void) {
    /* Compiler optimizes this into an initialized array in .rodata. */
    alignas(64) char data[sizeof(__m512i)];
    for (unsigned i = 0; i < sizeof(data); i++) {
        data[i] = i;
    }
    return _mm512_load_si512(data);
}

The only obvious approach I see (and the one that GCC produces with the above code) is to just take the generic approach of using a vmovdqa64 from memory - but this constant is low-entropy enough that it seems like one ought to be able to do better, somehow.

(I know that normally constant loads aren't typically in the critical path, or you have a spare register to dedicate to the constant to be able to reload it, but I'm interested if there are any tricks buried in this instruction set. For an instruction set with a full-width register multiply, for instance, you can fill every byte with 0x1, square the register, and left-shift the result by one - but that isn't suited to AVX512 so far as I can tell.)

Solution

I don't think there's any very efficient way to generate a sequence like that on the fly where different elements have different values. 64 different byte values is pretty high entropy if you can't take advantage of the similarity to previous elements.

It's only easy to broadcast 4-byte or 8-byte patterns (from mov-immediate to an integer register), or 4, 8, 16, or 32-byte patterns from memory. Or with vpmovzxbd for example, "compress" the storage of shuffle constants with wider elements (word, dword or qword), at the cost of an extra shuffle uop when you load it. Or to generate something on the fly where every element has the same value starting from a vector of all-ones bytes. But unless you're writing asm by hand, compilers will constant-propagate through intrinsics so you're at their mercy. Some of them are smart enough to use broadcast loads instead of expanding _mm512_set1_epi32(0x03020100) into 64 bytes, but not always.

There aren't instructions which do something different to each element, and the multiply trick is limited to a width of 64-bit chunks.

Interesting trick with 0x01010101 squared, that could be a good starting point, except you might as well start directly with mov eax, 0x00010203 / vpbroadcastd xmm0, eax (or ZMM) or vmovd xmm0, eax, or 64-bit mov rax, 0x0001020304050607 (10 bytes) / vpbroadcastq zmm0, rax (6 bytes) which are cheaper than vternlogd zmm0,zmm0,zmm0, -1 / vpabsb zmm0, zmm0 (to get set1_epi8(1)) plus vpmullq zmm0,zmm0,zmm0 / vpsllq zmm0, zmm0, 8.

There's not even a widening 64-bit => 128-bit multiply although AVX-512 does have vpmullq which AVX2 doesn't. However it's 2 uops on Intel CPUs. (One on Zen4).

Each AVX-512 instruction is at least 6 bytes (4-byte EVEX + opcode + modrm), so that adds up quickly if you're optimizing for pure size of .text+.rodata (which might not be unreasonable outside a loop). You still wouldn't want an actual loop that stored 4 bytes at a time for 16 iterations, like add eax, 0x04040404 / stosd, that would be slower than you want even outside a loop.

Starting with set1_epi32(0x03020100) or a 64-bit or 128-bit version would still need multiple shuffle and add steps to widen up to 512-bit, with the right amount of 0x04, 0x08, or 0x10 added to each part of the broadcast result.

I can't think of anything better, and it's still not good enough to use. Using some AVX2 instructions saves code size vs. ZMM all the way, unless I'm missing a way to save an instruction.

The strategy is to create [ 0x30 repeating | 0x20 repeating | 0x10 repeating | 0x00 repeating] in a ZMM and add it to a broadcast 16-byte pattern.

default rel
  vpbroadcastd     ymm1, [vec4_0x10]   ; we're loading another constant anyway, this is cheaper
  vpaddd           ymm2, ymm1,ymm1     ; set1(0x20)
  vmovdqa          xmm3, xmm1          ; [ set1(0)   , set1(0x10) ]     ; mov-elimination
  vpaddd           ymm4, ymm3, ymm2    ; [ set1(0x20), set1(0x30) ]
  vshufi32x4       zmm4, zmm3, zmm4, 0b00_01_00_01    ; _MM_SHUFFLE(0,1,0,1) works like shufps but in 16-byte chunks.
  vbroadcasti64x2  zmm0, [vec16_0to15]
  vpaddb           zmm0, zmm0, zmm4     ; memory-source broadcast only available with element size, e.g. vpaddq z,z,m64{1to8} but that'd take more granular shuffling

section .rodata
align 16
  vec16_0to15: db 0,1,2,3,4,5,6,7
              db 8,9,10,11,12,13,14,15

  vec4_0x10: dd 0x10101010

Size: machine code: 0x2c bytes. Constants: 16 + 4 = 0x14.
Total: 0x40 = 64 bytes, the same as putting the whole literal constant in memory.

Masking might have saved vector instructions, at the cost of needing to set up mask-register values which costs mov eax, imm32 / kmov k1, eax.

A less-bad tradeoff of instruction (uop) count vs. total size could be to start with a 32-byte 0..31 constant so you just need to set 1 bit in the upper half after broadcasting.

;; update: this is a better tradeoff, 61 total bytes and far fewer instructions
;; 25 bytes of machine code in 3 instructions
default rel
   vmovdqa ymm0, [vec_0to31]                ;  0..31
   vpord   ymm1, ymm0, [mask_0x20]{1to8}    ; 32..63
   vinserti32x8 zmm0, zmm0, ymm1, 1

section .rodata
;; 36 bytes of data
align 32
vec_0to31: db 0..31    ; see below for a way to actually write this in NASM
mask_0x20: dd 0x20202020

The 16-byte-chunk way saves about 10 bytes, the size of a ZMM load with a RIP-relative addressing mode to get it into a register from .rodata. Or 4 bytes, the size of a RIP-relative addressing mode, the difference between vpaddb zmm0, zmm0, zmm31 vs. vpaddb zmm0, zmm0, [vector_const] depending what you're doing with it.

$ objdump -drwC -Mintel foo
0000000000401000 <_start>:
  401000:       c4 e2 7d 58 0d 07 10 00 00      vpbroadcastd ymm1,DWORD PTR [rip+0x1007]        # 402010 <vec4_0x10>
  401009:       c5 f5 fe d1             vpaddd ymm2,ymm1,ymm1
  40100d:       c5 f9 6f d9             vmovdqa xmm3,xmm1
  401011:       c5 e5 fe e2             vpaddd ymm4,ymm3,ymm2
  401015:       62 f3 65 48 43 e4 11    vshufi32x4 zmm4,zmm3,zmm4,0x11
  40101c:       62 f2 fd 48 5a 05 da 0f 00 00   vbroadcasti64x2 zmm0,XMMWORD PTR [rip+0xfda]        # 402000 <vec16_0to15>
  401026:       62 f1 7d 48 fc c4       vpaddb zmm0,zmm0,zmm4

$ size foo
   text    data     bss     dec     hex filename
     64       0       0      64      40 foo

I did confirm this works with GDB attached to SDE:

# stopped before the last   vpaddb
(gdb) p /x $zmm0.v64_int8 
$2 = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x0,
  0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf}
(gdb) p /x $zmm4.v64_int8
$3 = {0x0 <repeats 16 times>, 0x10 <repeats 16 times>, 0x20 <repeats 16 times>, 0x30 <repeats 16 times>}

(gdb) si
0x000000000040102c in ?? ()
(gdb) p /x $zmm0.v64_int8 
$4 = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d,
  0x1e, 0x1f, 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39,
  0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f}

If you were considering doing something like this, use the version that starts with a 32-byte constant since 3 instructions is not unreasonable, and it's less total size. (Assuming you don't lose space to padding from aligning the constant, or following the 4-byte constant. You could leave it unaligned, especially if you know it doesn't cross a cache-line boundary. Extra latency from a cache-line split might stop out-of-order exec from getting started on the work that uses the constant, so that's undesirable.)

@chtz suggests another alternative in comments:

You can create a {0,...,0,8,...,16,...24,...} vector using vpmovzxbq from {0,1,2,...,7} combined with a vpmultishiftqb with a broadcasted -3. Then add a broadcasted 0x0001020304050607 (can use the same memory as the vpmovzxbq).

I haven't tested this, but could be interesting, especially if you want to use only immediates, no loads from .rodata. mov rax, 0x0706050403020100 / vpbroadcastq zmm0, rax / vpmovzxbq zmm1, xmm0 gives you the two constants based on that. With memory sources you could use vporq or vpaddq with a [mem]{1to8} instead of a separate vpbroadcastq. Getting a -3 vector might just be mov rax, -3 / vpbroadcastq. Still 2 instructions, but one of them scalar integer not competing for vector execution units.

Writing 0..n constants in asm source

I just don't know a concise way to write the constant to load in assembly that both concise and clear

(val: .rept 64 / .byte .-val / .endr satisfies the former but not the latter, for instance.)

That's a neat use of GAS syntax (although of course ; is the statement separator if you want to actually put it all on one line.) Seems like a comment on it would be sufficient.

In NASM syntax, %assign inside %rep 64 would be the natural way, as shown in the NASM manual's example of using %rep for unrolling a loop. In this case,

align 64
vec64_0to63:        ; self-explanatory name for the constant points readers in the right direction
  %assign i 0 
  %rep    64 
    db  i
    %assign i i+1 
  %endrep

Something equivalent is possible in GAS with .set.

%xdefine would be usable, too, although that would make the assembler eval a growing 0+1+1+1+1+... text string every time.

Conversely, your idea in NASM syntax looks like this, where a comment and the label name remind readers how it works. I actual prefer this to the %assign version; there's less going on to keep track of.

vec64_0to63:
%rep 64
    db $-v2       ; 0..63  value = offset
%endrep

Doing it all on one line with times doesn't work: v2: times 16 db $-v2 fills with zeros, because $-v2 is evaluated to a constant zero before being repeated.