Search code examples
rustavx512

how can I optimize this simple multi-valued simd splat/broadcast?


I want to expand some u8s out to u64, except instead of zero or sign extending, which have direct support, I want "copy-extending". What's the best way to do this (on intel cpus with avx512)? Example code is in rust but the host language isn't the interesting part.

#![feature(portable_simd)]

use std::simd::*;

// Expands out each input byte 8 times
pub fn batch_splat_scalar(x: [u8; 16]) -> [u64; 16] {
  let mut ret = [0; 16];
  for i in 0..16 {
    ret[i] =
      u64::from_le_bytes([x[i], x[i], x[i], x[i], x[i], x[i], x[i], x[i]]);
  }
  ret
}

pub fn batch_splat_simd(x: u8x16) -> u64x16 {
  Simd::from_array(batch_splat_scalar(x.to_array()))
}

which compiles to something like this with avx512

        vpmovzxbq       zmm0, qword ptr [rsi]
        vpbroadcastq    zmm1, qword ptr [rip + .LCPI0_0]
        mov     rax, rdi
        vpmuludq        zmm2, zmm0, zmm1
        vpbroadcastq    zmm3, qword ptr [rip + .LCPI0_1]
        vpmuludq        zmm0, zmm0, zmm3
        vpsllq  zmm0, zmm0, 32
        vporq   zmm0, zmm2, zmm0
        vmovdqu64       zmmword ptr [rdi], zmm0
        vpmovzxbq       zmm0, qword ptr [rsi + 8]
        vpmuludq        zmm1, zmm0, zmm1
        vpmuludq        zmm0, zmm0, zmm3
        vpsllq  zmm0, zmm0, 32
        vporq   zmm0, zmm1, zmm0
        vmovdqu64       zmmword ptr [rdi + 64], zmm0
        vzeroupper
        ret

https://godbolt.org/z/67cW5GnKf


Solution

  • So each u64 element of the result contains 8 copies of the corresponding u8 input? I think your best bet in asm is vpermb if you can use AVX-512VBMI (Ice Lake). With the right control vector, you can have each byte of a ZMM grab the byte you want from the low 16 bytes of another ZMM (i.e. an XMM).

    Otherwise broadcast and vpshufb zmm. (https://www.felixcloutier.com/x86/pshufb)

    One 128-bit to 512-bit broadcast load can feed two shuffles with different control vectors. Or two vpbroadcastq 64-bit broadcast loads can feed two vpshufb with the same control vector.

    On Intel at least, broadcast loads are just a load uop, no ALU. (https://uops.info/). So if you're loading data from memory anyway, do one broadcast load and use 2x vpshufb instead of 2x vpermb since it's cheaper (lower latency but still only one execution port.)


    I'm not familiar with Rust's std::simd, but the asm it emitted is very bad, using multiply bithacks (probably with constants like 0x0101010101010101) instead of shuffle instructions.

    The asm you want is something like

       VBROADCASTI32X4  zmm0,  [rsi]   # the mem operand is 128-bit, an xmmword
       vpshufb          zmm1, zmm0, [.LC0]  # or with the vector constants in regs if reused
       vpshufb          zmm0, zmm0, [.LC1] 
    

    The first vector constant is 0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1 etc. The second is 8,8,8,8,8,8,8,8, 9,9,9,9,9,9,9,9 etc. Indexing is within each 128-bit lane, which is why we used a broadcast load.

    If we used vpermb we could have used a simple vmovdqu xmm0, [rsi] which saves a couple bytes of machine-code size, but the shuffles would be higher latency (but still the same throughput, including on Zen 4 apparently.) But that higher latency might make it harder for out-of-order exec, lowering overall throughput.

    If your data was already in the bottom of a vector reg to start with, you would prefer vpermb over an ALU shuffle to broadcast or vpmovzx it.


    I had hoped VPMULTISHIFTQB with a 64-bit broadcast memory source operand would be even better, but apparently on Intel CPUs it can't micro-fuse into a single load+shuffle uop. So using it twice is no better than 2x vpbroadcastq loads plus 2x vpshufb, except for a small saving in machine-code size and different packing into the uop cache which might be worse or better. uops.info measured that vpmultishiftqb is 2 uops for the front-end on Ice Lake and Alder Lake (Sapphire Rapids). It could be a win on Zen 4 where it does fuse into a single uop.