4-way bytewise interleave 4x 16-byte vectors from memory, with AVX512

An avx512 vector can hold 64 int8 values. I would like to do something like the following:

load 16 contiguous values from memory location a, say they are 1
load 16 contiguous values from memory location b, say they are 2
load 16 contiguous values from memory location c, say they are 3
load 16 contiguous values from memory location d, say they are 4
produce an avx512 vector with the following pattern: 123412341234...1234.

Note: the 16 values from the memory load are not expected to be the same, as in the example shown above.

I know how to functionally do this through loads then shuffles. However, I would like to know what's the most effective way to do this in terms of # of registered used and expected throughput.

Perhaps there's some weird instructions optimized for this purpose.

Thanks!

Solution

Since you mention throughput as a major concern, minimizing back-end uops for the shuffle port would be a good idea, and/or minimizing total front-end uops. (see this re: perf analysis). The overall bottleneck will depend on surrounding code.

I think the best strategy is to get all the data into the right 128-bit chunks (lanes) of one vector efficiently, then fix that up with a vpshufb (_mm512_shuffle_epi8).

Normal 128-bit lane-insert loads (vinserti128 ymm, ymm, mem, imm) need 2 uops per instruction: load and merge, but the ALU part can run on any port on Skylake-X, p015, not just the shuffle unit on port 5. (Or with the vector ALU unit on port 1 shut down because of 512-bit uops in flight, just p05). https://uops.info/ and https://agner.org/optimize/.

Unfortunately, vinserti128 does not micro-fuse, so both uops have to go through the front-end separately¹.

However, vbroadcasti32x4 ymm{K}, [mem] does micro-fuse (RETIRE_SLOTS: 1.0) so we can do a 1-fused-domain-uop insert via a merge-masked broadcast-load. The merge-masking does require an ALU uop, apparently able to run on p015^*. (It's pretty dumb that memory-source vinserti128 can't just decode this way to 1 uop, but this does require a mask register prepared ahead of time.)

(*: The uops.info detailed results strangely show none of the uops actually running on port 0, but a ZMM version does. If testing shows that the ymm version (with 512-bit uops in flight) actually only runs on p5, then I guess do a broadcast-load into ZMM registers with a 0x00f0 merge-mask.)

I'd suggest something like this, if you can hoist loads of 2 shuffle-control vectors and setup of the mask register. [a] and [c] can be any addressing mode, but an indexed addressing mode like [rdi + rcx] may defeat micro-fusion of the broadcast, and make it un-laminate. (Or maybe not if it counts as a 2-operand instruction like add eax, [rdi + rcx] and thus can stay micro-fused in the back-end on Haswell/Skylake.)

## ahead of the loop
   mov         eax,  0xf0                   ; broadcast loads will write high 4 dwords
   kmovb       k1, eax
   vpmovzxbd   zmm6, [vpermt2d_control]     ; "compress" controls with shuffle/bcast loads
   vbroadcasti32x4   zmm7, [vpshufb_control]

## Inside the loop, the actual load+interleave
   vmovdqu     xmm0, [a]                 ; 1 uop, p23
   vmovdqu     xmm1, [c]                 ; 1 uop, p23
   vbroadcasti32x4  ymm0{k1}, [b]        ; 1 uop micro-fused, p23 + p015
    ; ZMM0 = 00... 00...  BBBBBBBBBBBBBBBB  AAAAAAAAAAAAAAAA
   vbroadcasti32x4  ymm1{k1}, [d]        ; 1 uop micro-fused, p23 + p015

   vpermt2d    zmm0, zmm6, zmm1          ; 1 uop, p5.  ZMM6 = shuffle control
    ;ZMM0 = DDDDCCCCBBBBAAAA  DDDDCCCCBBBBAAAA ...
   vpshufb     zmm0, zmm0, zmm7          ; 1 uop, p5.  ZMM7 = shuffle control
    ;ZMM0 = DCBADCBADCBADCBA  DCBADCBADCBADCBA ...

If you want to avoid vzeroupper after the loop, you could use xmm/ymm/zmm16 and 17 or something, in which case you'd want vmovdqu32 xmm20, [a], which takes more code-size than a VEX-encoded vmovdqu.

Shuffle constants:

default rel           ; you always want this for NASM
section .rodata
align 16
vpermt2d_control: db 0,4,16,20, 1,5,17,21, ...   ; vpmovzxbd load this
vpshufb_control:  db 0,4,8,12,  1,5,9,13, ...    ; 128-bit bcast load this
; The top 2x 128-bit parts of each ZMM is zero
; I think this is right; edits welcome with full constants (_mm512_set... syntax is fine)

If we were shuffling one ZMM with vpermd then vpshufb (after 3x insert, see below), I think it would be the same constant expanded 2 different ways (widen bytes to dwords, or repeat 4 times), doing the same shuffle to 16 dword in a ZMM and then to 16 bytes in each lane. So you'd save space in .rodata.

(You can load in any order: if have reason to expect that 2 of the sources will be ready first (store forwarding, or cache hit more likely, or load address ready first), you could use them as the source for the vmovdqu loads. Or pair them so the merge uop can execute and make room in the RS aka scheduler sooner. I paired them this way to make the shuffle control constants more human-friendly.)

If this isn't in a loop (so you can't hoist the constant setup) it's not worth spending 2 uops to set up k1, just use vinserti128 ymm0, ymm0, [b], 1 and same for ymm1, [d]. (2 uops each, not micro-fused, p23 + p015). Also, the vpshufb control vector can be a 64-byte memory source operand. A different strategy using vpuncklbw / hbw and inserts (@EOF's comment) might be worth considering if you want to avoid loading any constants, but that would be more shuffles. Or possibly vpmovzxbd loads + shift/merge?

Perf analysis

total front-end cost: 6 uops. (1.5 clock cycles on SKX). Down from 8 uops / 2 cycles with vinserti128
total back-end cost: minimum of 2 cycles per result
- 4 loads for p23
- 2 p5 shuffles
- 2 p05 merges (inserts), hopefully scheduled to p0. (Port 1's vector execution units are shut down when any 512-bit uops are in flight. It can still run stuff like imul, lea, and simple-integer stuff.)

(Any cache misses will result in the merge uops having to replay when the data does arrive.)

Running just this back-to-back will bottleneck on back-end throughput for ports 2/3 (loads) and 0, 5 (vector ALU). There's some room to squeeze some more uops through the front-end, like storing this somewhere and/or some loop overhead that runs on other ports. Or for less-than-perfect front-end throughput. Vector ALU work will contribute to the p0 / p5 bottleneck.

With intrinsics, clang's shuffle optimizer might turn the masked broadcast into vinserti128, but hopefully not. And GCC probably wouldn't spot that deoptimization. You didn't say what language you were using, and mentioned registers, so I'll just use asm in the answers. Easy enough to translate to C intrinsics, maybe C# SIMD stuff, or whatever other language you're actually using. (Hand-written asm is usually not necessary or worth it in production code, especially if you want portability to other compilers.)

It would also be possible to do one vmovdqu, vinserti128 ymm, and 2x vinserti32x4 zmm. (Or equivalent 1-uop merge-masking broadcast loads). But that would have worse ILP for merging, and we'd still need a vpermd + vpshufb because vpermb requires AVXM512VBMI (Ice Lake, not Skylake-X).

However, if you do also have AVX512VBMI, vpermb is only 1 uop on Ice Lake, so 3x insert + vpermb would be ideal for throughput. Doing the inserts with merge-broadcats would need 2 separate merge masks, 0xf0 (use with ymm 32x4 and zmm 64x2) and 0xf000 (use with zmm 32x4, loading [d] last), or some variation on that.

Using vpermt2b with the parallel-insert setup would be worse: Ice Lake vpermt2b costs 3 uops (p05 + 2p5).

The two shuffle constants can be compressed in memory to 16 bytes each: load the vpermt2d vector with vpmovzxbd to expand bytes to dwords, load the vpshufb control with VBROADCASTI64X2 zmm1, m128 to repeat the in-lane shuffle vector 4 times. It's probably worth fitting both constants into the same cache line, even though that costs a load+shuffle outside the loop.

If you implement this with C intrinsics, just use _mm512_set_epi8/32; compilers will usually defeat your attempt to be clever by doing constant-propagation. Clang and gcc are sometimes smart enough to compress constants for you, but usually only broadcast-loading, not vpmovzx.

Footnote 1: Agner Fog's instruction tables indicate that VINSERTI32x4 z,z,m,i can micro-fuse (1 front-end uop), but uops.info's mechanical testing results disagree: RETIRE_SLOTS: 2.0 matches UOPS_EXECUTED.THREAD: 2.0. Probably a typo in Agner's table; it's normal that memory-source instructions with an immediate don't micro-fuse.

(Also possible that it micro-fuses in the decoders and uop cache but not in the back-end; Agner's testing for micro-fusion is I think based on the uop cache, not the issue/rename bottleneck or perf counters. RETIRE_SLOTS counts fused-domain uops in the out-of-order back-end, after possible un-lamination before/during issue/rename.)

But anyway, VINSERTI32x4 definitely doesn't help for the issue/rename bottleneck which is more often significant in tight loops. And I doubt that it actually micro-fuses even in the decoders/uop-cache. Agner's tables unfortunately do have typos.

Alternate strategy: vpermt2d from memory (no advantages)

Before I came up with using a broadcast-load as a 1-uop insert, this had fewer front-end uops at the cost of more shuffles, and of doing wider loads from memory for 2 of the 4 sources. I don't think this has any advantages.

vpermt2d ymm, ymm, [mem] can micro-fuse into 1 load+shuffle uop for the front-end, on Skylake. (uops.info result: note RETIRE_SLOTS: 1.0 vs. UOPS_EXECUTED.THREAD: 2.0)

That would require doing 256-bit loads from 2 of the four 128-bit memory operands. That would be slower if it crosses a cache-line boundary when a 128-bit load wouldn't have. (And could fault if crossing into an unmapped page). It would also require more shuffle control vectors. But could save front-end uops vs. vinserti128, but not vs. merge-masked vbroadcasti32x4

;; Worse, don't use
; setup: ymm6, zmm7: vpermt2d/q shuffle controls: zmm8: vpshufb control
    vmovdqu   xmm0, [a]                   ; 1 uop p23
    vmovdqu   xmm1, [b]                   ; 1 uop p23
    vpermt2d  ymm0, ymm6, [c]             ; 1 uop micro-fused, p23 + p5.  256-bit load
    vpermt2d  ymm1, ymm6, [d]             ; 1 uop micro-fused, p23 + p5

   vpermt2q    zmm0, zmm7, zmm1           ; 1 uop, p5
    ;ZMM0 = DDDDCCCCBBBBAAAA  DDDDCCCCBBBBAAAA ...
   vpshufb     zmm0, zmm0, zmm8           ; 1 uop, p5
    ;ZMM0 = DCBADCBADCBADCBA  DCBADCBADCBADCBA ...

front-end cost: 6 uops
back-end cost: 4 uops for port 5, and 4 uops for p2/3

It might be possible to use the same shuffle control for combining pairs and for the final ZMM vpermt2d or q. Maybe with vpermt2q for combining pairs and vpermt2d last? I haven't really thought this through, whether you could choose a ZMM shuffle vector such that the low YMM can works for combining a pair of vectors with a different element size. Probably not.

Unfortunately vpblendd ymm, ymm, [mem], imm8 doesn't micro-fuse.

If you happen to know how any of [a..d] are aligned relative to a cache-line boundary, you could avoid cache-line splits when doing a 256-bit load that includes the data you want as the low or high 128 bits, choosing your vpermt2d shuffle control appropriately.

Alternate strategy that mixes up the order of data, unless you have AVX512VBMI

Would work with AVX512VBMI vpermb (Ice Lake) instead of AVX512BW vpshufb
5 fused-domain uops, 1 vector const, 3 masks

Avoid the vpermt2d by using different masked-broadcasts to distribute the 4 dwords of each 16-byte source chunk into separate lanes, such that every byte ends up somewhere, and each 16-byte lane of the result has data from all 4 vectors. (With vpermb, distributing across lanes is unnecessary; as described above you can just do whole-lane masking with masks like 0xf0)

Every lane has 4 bytes of data from each of a,b,c, and d, with no duplication because every mask has a different set-bit in each nibble.

# before the loop: setup
  ;mov      eax, 0x8421      ; A_mask.  Implicit, later merges leave these elements
  mov       eax, 0x4218      ; B_mask
  kmovw     k1, eax
  mov       eax, 0x2184      ; C_mask
  kmovw     k2, eax
  mov       eax, 0x1842      ; D_mask
  kmovw     k3, eax
  vbroadcasti32x4  zmm7, [inlane_shuffle]    ; for vpshufb


## Inside the loop, the actual load+interleave
  vbroadcasti32x4  zmm0, [a]
      ; ZMM0 = AAAA AAAA AAAA AAAA   (each A is a 4-byte chunk)
  vbroadcasti32x4  zmm0{k1}, [b]          ; b_mask = 0x4218
      ; ZMM0 = A3B2A1A0  AAB1A    AAAB0    B3A2A1A0
  vbroadcasti32x4  zmm0{k2}, [c]          ; c_mask = 0x2184
      ; ZMM0 = A3B2C1A0  AAB1C0   C3AAB0   B3C2A1A0
  vbroadcasti32x4  zmm0{k3}, [d]          ; d_mask = 0x1842
      ; ZMM0 = A3B2C1D0  D3A2B1C0 C3D2A1B0 B3C2D1A0

  vpshufb  zmm0, zmm0, zmm7    ; not lane-crossing >.<

With a 64-byte shuffle mask, you could do a shuffle in each lane that produces DCBA... in each lane, but with data from non-corresponding source positions.

This is probably not useful (without vpermb), but I started writing up this idea before realizing it was impossible with masked-broadcasts to get the first 4 bytes of [a] into the same lane as the first 4 bytes of [b], and so on.

The mask setup could actually be optimized to smaller code and fewer front-end uops, but higher latency before k2 and k3 are actually ready for use. Using a k reg as a mask for a SIMD instruction that needs 16 mask bits ignores higher bits in the mask reg, so we can get the mask data into one and right shift it a couple times to produce masks with what we want in the low 16.

mov       eax, 0x42184218
                          ; 0x8421  A_mask
kmovd     k1, eax         ; 0x4218 in low 16 bits
kshiftrd  k2, k1, 12      ; 0x2184 in low 16 bits   ; 4 cycle latency, port  5 only.
kshiftrd  k3, k1, 8       ; 0x1842 in low 16

But again, if you have vpermb then you only need 2 masks, 0xf0 and 0xf000, using the 0xf0 mask with vbroadcasti32x4 ymm{k1}, [b] and vbroadcasti64x2 zmm{k1}, [c].