I want to expand some u8
s out to u64
, except instead of zero or sign extending, which have direct support, I want "copy-extending". What's the best way to do this (on intel cpus with avx512)? Example code is in rust but the host language isn't the interesting part.
#![feature(portable_simd)]
use std::simd::*;
// Expands out each input byte 8 times
pub fn batch_splat_scalar(x: [u8; 16]) -> [u64; 16] {
let mut ret = [0; 16];
for i in 0..16 {
ret[i] =
u64::from_le_bytes([x[i], x[i], x[i], x[i], x[i], x[i], x[i], x[i]]);
}
ret
}
pub fn batch_splat_simd(x: u8x16) -> u64x16 {
Simd::from_array(batch_splat_scalar(x.to_array()))
}
which compiles to something like this with avx512
vpmovzxbq zmm0, qword ptr [rsi]
vpbroadcastq zmm1, qword ptr [rip + .LCPI0_0]
mov rax, rdi
vpmuludq zmm2, zmm0, zmm1
vpbroadcastq zmm3, qword ptr [rip + .LCPI0_1]
vpmuludq zmm0, zmm0, zmm3
vpsllq zmm0, zmm0, 32
vporq zmm0, zmm2, zmm0
vmovdqu64 zmmword ptr [rdi], zmm0
vpmovzxbq zmm0, qword ptr [rsi + 8]
vpmuludq zmm1, zmm0, zmm1
vpmuludq zmm0, zmm0, zmm3
vpsllq zmm0, zmm0, 32
vporq zmm0, zmm1, zmm0
vmovdqu64 zmmword ptr [rdi + 64], zmm0
vzeroupper
ret
So each u64 element of the result contains 8 copies of the corresponding u8 input? I think your best bet in asm is vpermb
if you can use AVX-512VBMI (Ice Lake). With the right control vector, you can have each byte of a ZMM grab the byte you want from the low 16 bytes of another ZMM (i.e. an XMM).
Otherwise broadcast and vpshufb zmm
. (https://www.felixcloutier.com/x86/pshufb)
One 128-bit to 512-bit broadcast load can feed two shuffles with different control vectors. Or two vpbroadcastq
64-bit broadcast loads can feed two vpshufb
with the same control vector.
On Intel at least, broadcast loads are just a load uop, no ALU. (https://uops.info/). So if you're loading data from memory anyway, do one broadcast load and use 2x vpshufb
instead of 2x vpermb
since it's cheaper (lower latency but still only one execution port.)
I'm not familiar with Rust's std::simd
, but the asm it emitted is very bad, using multiply bithacks (probably with constants like 0x0101010101010101
) instead of shuffle instructions.
The asm you want is something like
VBROADCASTI32X4 zmm0, [rsi] # the mem operand is 128-bit, an xmmword
vpshufb zmm1, zmm0, [.LC0] # or with the vector constants in regs if reused
vpshufb zmm0, zmm0, [.LC1]
The first vector constant is 0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1 etc. The second is 8,8,8,8,8,8,8,8, 9,9,9,9,9,9,9,9 etc. Indexing is within each 128-bit lane, which is why we used a broadcast load.
If we used vpermb
we could have used a simple vmovdqu xmm0, [rsi]
which saves a couple bytes of machine-code size, but the shuffles would be higher latency (but still the same throughput, including on Zen 4 apparently.) But that higher latency might make it harder for out-of-order exec, lowering overall throughput.
If your data was already in the bottom of a vector reg to start with, you would prefer vpermb
over an ALU shuffle to broadcast or vpmovzx
it.
I had hoped VPMULTISHIFTQB
with a 64-bit broadcast memory source operand would be even better, but apparently on Intel CPUs it can't micro-fuse into a single load+shuffle uop. So using it twice is no better than 2x vpbroadcastq
loads plus 2x vpshufb
, except for a small saving in machine-code size and different packing into the uop cache which might be worse or better. uops.info measured that vpmultishiftqb is 2 uops for the front-end on Ice Lake and Alder Lake (Sapphire Rapids). It could be a win on Zen 4 where it does fuse into a single uop.