Are there any non-obvious tricks to fill an AVX512 register with incrementing bytes (little-endian)? That is, the equivalent of this code:
__m512i make_incrementing_bytes(void) {
/* Compiler optimizes this into an initialized array in .rodata. */
alignas(64) char data[sizeof(__m512i)];
for (unsigned i = 0; i < sizeof(data); i++) {
data[i] = i;
}
return _mm512_load_si512(data);
}
The only obvious approach I see (and the one that GCC produces with the above code) is to just take the generic approach of using a vmovdqa64
from memory - but this constant is low-entropy enough that it seems like one ought to be able to do better, somehow.
(I know that normally constant loads aren't typically in the critical path, or you have a spare register to dedicate to the constant to be able to reload it, but I'm interested if there are any tricks buried in this instruction set. For an instruction set with a full-width register multiply, for instance, you can fill every byte with 0x1, square the register, and left-shift the result by one - but that isn't suited to AVX512 so far as I can tell.)
I don't think there's any very efficient way to generate a sequence like that on the fly where different elements have different values. 64 different byte values is pretty high entropy if you can't take advantage of the similarity to previous elements.
It's only easy to broadcast 4-byte or 8-byte patterns (from mov-immediate to an integer register), or 4, 8, 16, or 32-byte patterns from memory. Or with vpmovzxbd
for example, "compress" the storage of shuffle constants with wider elements (word, dword or qword), at the cost of an extra shuffle uop when you load it. Or to generate something on the fly where every element has the same value starting from a vector of all-ones bytes. But unless you're writing asm by hand, compilers will constant-propagate through intrinsics so you're at their mercy. Some of them are smart enough to use broadcast loads instead of expanding _mm512_set1_epi32(0x03020100)
into 64 bytes, but not always.
There aren't instructions which do something different to each element, and the multiply trick is limited to a width of 64-bit chunks.
Interesting trick with 0x01010101
squared, that could be a good starting point, except you might as well start directly with mov eax, 0x00010203
/ vpbroadcastd xmm0, eax
(or ZMM) or vmovd xmm0, eax
, or 64-bit mov rax, 0x0001020304050607
(10 bytes) / vpbroadcastq zmm0, rax
(6 bytes) which are cheaper than vternlogd zmm0,zmm0,zmm0, -1
/ vpabsb zmm0, zmm0
(to get set1_epi8(1)
) plus vpmullq zmm0,zmm0,zmm0
/ vpsllq zmm0, zmm0, 8
.
There's not even a widening 64-bit => 128-bit multiply although AVX-512 does have vpmullq
which AVX2 doesn't. However it's 2 uops on Intel CPUs. (One on Zen4).
Each AVX-512 instruction is at least 6 bytes (4-byte EVEX + opcode + modrm), so that adds up quickly if you're optimizing for pure size of .text+.rodata (which might not be unreasonable outside a loop). You still wouldn't want an actual loop that stored 4 bytes at a time for 16 iterations, like add eax, 0x04040404
/ stosd
, that would be slower than you want even outside a loop.
Starting with set1_epi32(0x03020100)
or a 64-bit or 128-bit version would still need multiple shuffle and add steps to widen up to 512-bit, with the right amount of 0x04, 0x08, or 0x10 added to each part of the broadcast result.
I can't think of anything better, and it's still not good enough to use. Using some AVX2 instructions saves code size vs. ZMM all the way, unless I'm missing a way to save an instruction.
The strategy is to create [ 0x30 repeating | 0x20 repeating | 0x10 repeating | 0x00 repeating]
in a ZMM and add it to a broadcast 16-byte pattern.
default rel
vpbroadcastd ymm1, [vec4_0x10] ; we're loading another constant anyway, this is cheaper
vpaddd ymm2, ymm1,ymm1 ; set1(0x20)
vmovdqa xmm3, xmm1 ; [ set1(0) , set1(0x10) ] ; mov-elimination
vpaddd ymm4, ymm3, ymm2 ; [ set1(0x20), set1(0x30) ]
vshufi32x4 zmm4, zmm3, zmm4, 0b00_01_00_01 ; _MM_SHUFFLE(0,1,0,1) works like shufps but in 16-byte chunks.
vbroadcasti64x2 zmm0, [vec16_0to15]
vpaddb zmm0, zmm0, zmm4 ; memory-source broadcast only available with element size, e.g. vpaddq z,z,m64{1to8} but that'd take more granular shuffling
section .rodata
align 16
vec16_0to15: db 0,1,2,3,4,5,6,7
db 8,9,10,11,12,13,14,15
vec4_0x10: dd 0x10101010
Size: machine code: 0x2c bytes. Constants: 16 + 4 = 0x14.
Total: 0x40 = 64 bytes, the same as putting the whole literal constant in memory.
Masking might have saved vector instructions, at the cost of needing to set up mask-register values which costs mov eax, imm32
/ kmov k1, eax
.
A less-bad tradeoff of instruction (uop) count vs. total size could be to start with a 32-byte 0..31 constant so you just need to set 1 bit in the upper half after broadcasting.
;; update: this is a better tradeoff, 61 total bytes and far fewer instructions
;; 25 bytes of machine code in 3 instructions
default rel
vmovdqa ymm0, [vec_0to31] ; 0..31
vpord ymm1, ymm0, [mask_0x20]{1to8} ; 32..63
vinserti32x8 zmm0, zmm0, ymm1, 1
section .rodata
;; 36 bytes of data
align 32
vec_0to31: db 0..31 ; see below for a way to actually write this in NASM
mask_0x20: dd 0x20202020
The 16-byte-chunk way saves about 10 bytes, the size of a ZMM load with a RIP-relative addressing mode to get it into a register from .rodata. Or 4 bytes, the size of a RIP-relative addressing mode, the difference between vpaddb zmm0, zmm0, zmm31
vs. vpaddb zmm0, zmm0, [vector_const]
depending what you're doing with it.
$ objdump -drwC -Mintel foo
0000000000401000 <_start>:
401000: c4 e2 7d 58 0d 07 10 00 00 vpbroadcastd ymm1,DWORD PTR [rip+0x1007] # 402010 <vec4_0x10>
401009: c5 f5 fe d1 vpaddd ymm2,ymm1,ymm1
40100d: c5 f9 6f d9 vmovdqa xmm3,xmm1
401011: c5 e5 fe e2 vpaddd ymm4,ymm3,ymm2
401015: 62 f3 65 48 43 e4 11 vshufi32x4 zmm4,zmm3,zmm4,0x11
40101c: 62 f2 fd 48 5a 05 da 0f 00 00 vbroadcasti64x2 zmm0,XMMWORD PTR [rip+0xfda] # 402000 <vec16_0to15>
401026: 62 f1 7d 48 fc c4 vpaddb zmm0,zmm0,zmm4
$ size foo
text data bss dec hex filename
64 0 0 64 40 foo
I did confirm this works with GDB attached to SDE:
# stopped before the last vpaddb
(gdb) p /x $zmm0.v64_int8
$2 = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x0,
0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf}
(gdb) p /x $zmm4.v64_int8
$3 = {0x0 <repeats 16 times>, 0x10 <repeats 16 times>, 0x20 <repeats 16 times>, 0x30 <repeats 16 times>}
(gdb) si
0x000000000040102c in ?? ()
(gdb) p /x $zmm0.v64_int8
$4 = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d,
0x1e, 0x1f, 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39,
0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f}
If you were considering doing something like this, use the version that starts with a 32-byte constant since 3 instructions is not unreasonable, and it's less total size. (Assuming you don't lose space to padding from aligning the constant, or following the 4-byte constant. You could leave it unaligned, especially if you know it doesn't cross a cache-line boundary. Extra latency from a cache-line split might stop out-of-order exec from getting started on the work that uses the constant, so that's undesirable.)
@chtz suggests another alternative in comments:
You can create a
{0,...,0,8,...,16,...24,...}
vector usingvpmovzxbq
from{0,1,2,...,7}
combined with avpmultishiftqb
with a broadcasted-3
. Then add a broadcasted0x0001020304050607
(can use the same memory as thevpmovzxbq
).
I haven't tested this, but could be interesting, especially if you want to use only immediates, no loads from .rodata
. mov rax, 0x0706050403020100
/ vpbroadcastq zmm0, rax
/ vpmovzxbq zmm1, xmm0
gives you the two constants based on that. With memory sources you could use vporq
or vpaddq
with a [mem]{1to8}
instead of a separate vpbroadcastq
. Getting a -3
vector might just be mov rax, -3
/ vpbroadcastq
. Still 2 instructions, but one of them scalar integer not competing for vector execution units.
I just don't know a concise way to write the constant to load in assembly that both concise and clear
(
val: .rept 64 / .byte .-val / .endr
satisfies the former but not the latter, for instance.)
That's a neat use of GAS syntax (although of course ;
is the statement separator if you want to actually put it all on one line.) Seems like a comment on it would be sufficient.
In NASM syntax, %assign
inside %rep 64
would be the natural way, as shown in the NASM manual's example of using %rep
for unrolling a loop. In this case,
align 64
vec64_0to63: ; self-explanatory name for the constant points readers in the right direction
%assign i 0
%rep 64
db i
%assign i i+1
%endrep
Something equivalent is possible in GAS with .set
.
%xdefine
would be usable, too, although that would make the assembler eval a growing 0+1+1+1+1+...
text string every time.
Conversely, your idea in NASM syntax looks like this, where a comment and the label name remind readers how it works. I actual prefer this to the %assign
version; there's less going on to keep track of.
vec64_0to63:
%rep 64
db $-v2 ; 0..63 value = offset
%endrep
Doing it all on one line with times
doesn't work: v2: times 16 db $-v2
fills with zeros, because $-v2
is evaluated to a constant zero before being repeated.