Fast algorithm to spread bits of u8 to the LSBs of each byte of a u64

Looking for bit twiddling insights to optimize an algorithm to spread the bits of an 8-bit integer to the LSB of each byte of a 64-bit integer. Example:

0b10110011 -> 0x0100010100000101

The best I've come up with so far is:

fn spread(x: u8) -> u64 {
    let x = x as u64;
    let y = (x * 0x0101010101010101) & 0x8040201008040201;
    (y | (y >> 1) | (y >> 2) | (y >> 3) | (y >> 4) | (y >> 5) | (y >> 6) | (y >> 7))
        & 0x0101010101010101
}

This results in branchless, but still quite long code:

    movzx   eax, dil
    movabs  rcx, 72340172838076673
    imul    rax, rcx
    movabs  rdx, -9205322385119247871
    and rdx, rax
    mov rsi, rdx
    mov rdi, rdx
    mov r8, rdx
    mov r9, rdx
    mov r10, rdx
    mov rax, rdx
    shr rax, 7
    or  rax, rdx
    shr rdx
    shr rsi, 2
    or  rsi, rdx
    shr rdi, 3
    or  rdi, rsi
    shr r8, 4
    or  r8, rdi
    shr r9, 5
    or  r9, r8
    shr r10, 6
    or  r10, r9
    or  rax, r10
    and rax, rcx
    ret

Clearly, the many shifts account for most of the instructions. Clever ideas to reduce the computation needed?

Solution

SWAR?

(((x&0x55) * 0x02040810204081LL) | ((x&0xAA) * 0x02040810204081LL)) & 0x0101010101010101LL