Looking for bit twiddling insights to optimize an algorithm to spread the bits of an 8-bit integer to the LSB of each byte of a 64-bit integer. Example:
0b10110011 -> 0x0100010100000101
The best I've come up with so far is:
fn spread(x: u8) -> u64 {
let x = x as u64;
let y = (x * 0x0101010101010101) & 0x8040201008040201;
(y | (y >> 1) | (y >> 2) | (y >> 3) | (y >> 4) | (y >> 5) | (y >> 6) | (y >> 7))
& 0x0101010101010101
}
This results in branchless, but still quite long code:
movzx eax, dil
movabs rcx, 72340172838076673
imul rax, rcx
movabs rdx, -9205322385119247871
and rdx, rax
mov rsi, rdx
mov rdi, rdx
mov r8, rdx
mov r9, rdx
mov r10, rdx
mov rax, rdx
shr rax, 7
or rax, rdx
shr rdx
shr rsi, 2
or rsi, rdx
shr rdi, 3
or rdi, rsi
shr r8, 4
or r8, rdi
shr r9, 5
or r9, r8
shr r10, 6
or r10, r9
or rax, r10
and rax, rcx
ret
Clearly, the many shifts account for most of the instructions. Clever ideas to reduce the computation needed?
SWAR?
(((x&0x55) * 0x02040810204081LL) | ((x&0xAA) * 0x02040810204081LL)) & 0x0101010101010101LL