Search code examples
assemblyrustarm64neonhammingweight

Efficiently calculate hamming weight


I'm on an apple m1 processor.

What i'm trying to do is efficiently count the 1 bits in a large char array in rust. I looked up the arm neon instructions, and I think I can do it via a cnt instruction (which counts 1's per 8 bit block) and then a addv to add each of the 8 bit blocks.

to start with i figured i would feed in a 64 bit uint.

fn aarch64_hamming_weight(inp: u64) -> u64 {
    let mut outp: u64;
    unsafe {
        asm!(
            "dup v0.2d, {input}",
            "cnt v0.16b, v0.16b",
            "addv b1, v0.16b",// this adds all the bytes of the vector, how do i get the result?!
            //"fmov {output2}, b1",
            "fmov {output}, v0.d[1]",
            input = in(reg) inp,
            output = out(reg) outp,
        )
    }

    return outp;
}

this nearly works, but if your number is greater than 8 bits then the result isn't quite right; 225's binary count is 4 425's binary count is 260

so

  1. I need a way to get the addv's output
  2. I need a way for it to work with an arbitrary char array (which would need a loop, pulling 128 bits at a time)

Solution

  • When I started writing an answer, I thought it was going to show that inline asm was totally pointless, and RustC was going to do a good job vectorizing u8.count_ones(). That is unfortunately not the case, but you should only use asm if you're going to write a whole loop, not one u64 at a time. (Possibly still inline asm, but writing a whole function might be reasonable.)

    If Rust has SIMD intrinsics for ARM/AArch64, that would be a much better choice than inline asm, and definitely worth using.


    For counting a whole array, you don't want to reduce every cnt result to scalar separately, especially not all the way to a general-purpose integer register. Especially not for a single u64 at a time instead of a full 128 bits. Add into a vector of counts that you only reduce at the end.

    clang/LLVM knows how to auto-vectorize popcount __builtin_popcount() in C over an array in C, so I was hoping LLVM-based RustC would do ok for AArch64. It is somewhat ok for u64, but terrible for u8 unfortunately. If you can safely point a u64 span at your data (or reference or however Rust does things?), then you can get a good fraction of the benefit portably with no brittle inline asm. And in a way that future compiler versions can hopefully improve on.

    pub fn hamming_weight_arr(inp: &[u64]) -> u64 {
        let mut sum : u64 = 0;
        for v in inp {
            sum += v.count_ones() as u64;  // maybe not terrible but a lot of hsumming.
            // doing ldp q2, q3 to load 32 bytes per iteration
        }
        return sum;
    }
    

    compiled on https://godbolt.org/z/7qP7cK6bv with -O --target aarch64-unknown-linux-gnu (With the default -C --opt-level=3).

    ... some setup
    .LBB1_5:                                 // inner loop
            ldp     q2, q3, [x8, #-16]       // load pair of 16-byte vectors = unroll by 4x u64
            add     x8, x8, #32              // pointer increment by 32 bytes
            subs    x12, x12, #4
            cnt     v2.16b, v2.16b
            cnt     v3.16b, v3.16b
            uaddlp  v2.8h, v2.16b            // hsum byte pairs to u16 halfwords
            uaddlp  v3.8h, v3.16b
            uaddlp  v2.4s, v2.8h             // hsum u16 pairs to u32 words
            uaddlp  v3.4s, v3.8h
            uadalp  v0.2d, v2.4s          // sum u32 pairs into accumulator u64 vectors (doublewords)
            uadalp  v1.2d, v3.4s
            b.ne    .LBB1_5
            add     v0.2d, v1.2d, v0.2d        // combine pair of accumulators
            cmp     x10, x11
            addp    d0, v0.2d                  // and reduce to one 64-bit
            fmov    x8, d0
            b.eq    .LBB1_9
    .LBB1_7:
            add     x10, x0, x1, lsl #3
    .LBB1_8:
            ldr     d0, [x9], #8               // scalar cleanup loop, one u64 at a time
            cmp     x9, x10
            cnt     v0.8b, v0.8b
            uaddlv  h0, v0.8b                  // slow instruction, as Jake mentioned.  Or at least high latency
            fmov    w11, s0
            add     x8, x11, x8
            b.ne    .LBB1_8
    .LBB1_9:
            mov     x0, x8
            ret
    

    You'd think sum: u32 would help, requiring less widening inside the loop. (If you have huge arrays that could overflow that, use an outer loop). But actually, RustC still widens to 64-bit I think, but then does even more work to truncate those counts to 32-bit. Almost certainly a missed-optimization. (Perhaps a strategy built around x86 psadbw which does sum bytes into u64 chunks in one step; LLVM auto-vectorizes popcount with pshufb and that for AVX2 on x86.)

    And you'd think doing the same thing for a u8 array should vectorize to the same code, with just some extra scalar cleanup, right? Well it should, but actually it still only unrolls by 4 like LLVM likes to do, but that's 4 u8 elements with the inner loop becoming total garbage:

    // &[u8] version  inner loop is a disaster
    
    LBB2_5:
            ldurb   w12, [x8, #-2]            // scalar zero-extending byte load
            subs    x11, x11, #4
            ldrb    w13, [x8]                 // scalar sign-extending byte load
            fmov    d2, x12                   // copy it into vector reg
            ldurb   w12, [x8, #-1]
            fmov    d3, x13
            ldrb    w13, [x8, #1]
            add     x8, x8, #4
            mov     v2.d[1], x12              // get two more bytes into the high 64 bits of a vector
            mov     v3.d[1], x13
            cnt     v2.16b, v2.16b           // same cnt / uaddlp sequence as before
            cnt     v3.16b, v3.16b
            uaddlp  v2.8h, v2.16b
            uaddlp  v3.8h, v3.16b
            uaddlp  v2.4s, v2.8h
            uaddlp  v3.4s, v3.8h
            uadalp  v0.2d, v2.4s
            uadalp  v1.2d, v3.4s
            b.ne    .LBB2_5
    

    So it's vectorizing (v as u64).count(), and using canned recipes that it doesn't know how to optimize into. (e.g. the uaddlp widening is pointless if the cnt results are zero except in the low byte of each 64-bit chunk, they could just use vertical add with 64-bit chunks.)


    Compare against what you get from a C compiler for manually vectorized ARM code from https://github.com/WojciechMula/sse-popcount/. If the ARM Neon code is as well-tuned as the x86 code in the same repo, that's what you should be aiming for in terms of final asm output from your compiler, however you get there.

    I'd guess that an inner loop that only widens the per-byte counts to 16-bit would be good, running up to the number of iterations that are possible without overflowing a u16 with +=16 from seeing all-ones in the pair of bytes that feed into it. i.e. 65535/16 rounded down = 4095 vectors before widening out to 64-bit chunks.

    Mula's version does only 31 inner loop iterations, accumulating into 8-bit elements with vaddq_u8. But uadalp v0.16b, v2.8h horizontal-pair accumulation of u8 bytes into u16 halfword elements wasn't available for 32-bit NEON, only for AArch64 ASIMD.

    The key point is to do the minimum amount of work in the inner-most loop, ideally using only one cheap instruction per vector of cnt results. If you can get some widening for free as you accumulate (or cheap enough not to be a bottleneck), that lets execution stay in the inner loop for much longer without possible overflow. (And means the later horizontal reduction step in the outer loop is cheaper.)

    uadalp has somewhat different performance from pure-vertical add on some CPUs, but very likely worth using. The Cortex-A57 optimization guide says it has 4 (1) cycle latency, 1/clock throughput. The (1) part is the accumulate latency for the destination operand; it allows late forwarding from a previous operation of the same type, after the horizontal-pair addition of source elements has already happened. So in a loop using sum += pairs with uadalp, the loop-carried dependency chain is only 1 cycle long. This is ideal.

    Regular add on integer vectors is 3 cycle latency, 2/clock throughput, so it has better throughput but creates a 3-cycle loop-carried dependency chain. (And doesn't get a step of horizontal accumulation work done, and will overflow much sooner because it's only using 8-bit sums.)

    On Cortex-A57, cnt is also only 1/clock throughput, so uadalp's 1/clock throughput isn't an overall bottleneck. Unless they compete for the same execution port. uadalp runs on F1, SIMD-integer add runs on F0/F1, cnt runs on F0/F1. So either way, addition operations need to steal some cnt throughput, and the question becomes whether cnt can get scheduled efficiently to mostly just port F0, when F1 has a bunch of future uadalp operations queued up. (On data that isn't ready yet; out-of-order exec looks ahead. On most CPUs, operations are scheduled to ports as they rename/allocate from the front-end into the back-end. The CPU doesn't know what order data will become ready, but it can see the queue lengths different for ports then.

    It would be possible to do (pseudocode)

    // Probably not a good idea
    c1 = cnt(load1)
    c2 = cnt(load2)
    c1 += c2    // plain SIMD vertical add
    uadalp accumulator, c1
    

    That means only one uadalp in the loop, in case that's a throughput bottleneck, while still only using uadalp into the accumulator which keeps the loop-carried dependency chain through the accumulator short. (Assuming other AArch64 CPUs also do early forwarding for accumulate instructions).

    And it makes the short independent dependency chain within one iteration slightly longer (from load to input of accumulation), instead of keeping it as two separate dep chains.) Low-end ARM CPUs generally have very limited out-of-order execution capabilities (small scheduler buffers) to find instruction-level parallelism and overlap work across loop iterations. Keeping the non-loop-carried dep chains short makes that easier for the CPU. And it totally sucks for an in-order AArch64 like Cortex-A53 unless you're unrolling a lot.