Search code examples
assemblybit-manipulationx86-64micro-optimization

x86-64 instruction to AND until zero?


I know there's an instruction that repeats (such as repnz).
I have a situation where I have an (8-bit) array that's [7, 2, 3, 4, ..., 7, 0, 0, 0 ...] (I have 64 bytes of zero at the end). I'd like to load the first value (7) and AND it with the rest until there's a zero:

7 & 2 == 2
2 & 3 == 2
2 & 4 == 0

So I'd like to stop here and have a pointer or index 3 (array[3] == 4). I also need the value before the result became zero.

Is there a rep instruction or SIMD instruction I can use to find this? Is there any clever C code I can write? At the moment I use a simple while loop and I end with both the index that caused the result to become zero and the 'previousResult' so I can use the value immediately before it resulted in zero (2 in the above example)


Solution

  • If long runs are common, you could do some shifts and ANDs with the same pattern as a prefix-sum, except with AND instead of addition. This has extra setup and cleanup overhead vs. a scalar loop, but scales well with larger sizes and avoids branch mispredicts.

    Replace _mm_add_epi8 with _mm_and_si128. Find the first zero in the prefix-AND-scan with pcmpeqb against a zeroed vector, then the usual pmovmskb/bsf (or tzcnt) if there's a zero in the first 16 bytes.

    Relevant Q&As:

    AVX-512 has some interesting instructions. vpternlogd can do a 3-input AND, but only vertically not within one vector, so you'd have to shuffle to set up for it. Perhaps could be useful in a prefix-AND-scan.

    Other AVX-512 SIMD instructions probably aren't useful. vp2intersectd (https://felixcloutier.com/x86/vp2intersectd:vp2intersectq - Tiger Lake only) looks for pairwise equality between elements, not bitwise intersections. Similarly, vpconflictd just looks for exact matches of whole integers. GFNI instructions like VGF2P8AFFINEQB can do some neat stuff, but IDK if the right constants could get you a bitwise AND across bytes. Probably not.


    I don't have exact numbers but length wise 1-5 is very popular, 8 and 12 are not popular but not unusual. I suspect doing a SIMD load + a few shuffles + getting the index of 0 would probably be more work than a 1 byte loop when the data is 5 length or less, which is all the time. So I guess I'm stuck or have to find a new way to represent the data.

    Yeah, for short lengths, SIMD would likely be slower as long as branches are correctly predicted in the short version. Perhaps peeling the first 3 or 4 iterations and using some cmov could mitigate branch mispredictions. (For throughput, if this thread is sharing a core with hyperthreading, a branch miss isn't so bad, although it still wastes some front-end bandwidth before being detected.)

    Perhaps some integer SWAR can create some ILP (instruction level parallelism), like first checking x & a[1] with test al, dl/jnz, in parallel with doing a & (a>>8) on 64 bits at a time? Then and al, dl or something where dl holds a[1]&a[2]. Unrolling by two so one branch is not-taken every iteration could be good.

    It's usually not worth hand-optimizing the asm beyond what you can get a compiler to do, but x86 partial registers to do byte-sized ANDs on the low byte of wider shifts might be hard to get compilers to emit. But actually we don't need that; as long as we're working with a value zero-extended into an unsigned or uint64_t, AND will be zero in the higher bytes, so at worst a silly compiler will waste some REX prefixes where it could have used 32-bit operand-size. (In C, memcpy to a uint64_t for unaligned aliasing-safe loads).

    Just for fun, I took a stab at writing some asm before https://uica.uops.info/ reminded me that shifts compete with branches for ports 0 and 6 on Intel CPUs. So resource conflicts are just as bad a problem for critical-path latency as a longer dependency chain, and it's tough to get this to execute any better than 8 cycles per iteration, or about 1 cycle per byte checked on Ice Lake, which is about what you'd get from a simple memory-source and al, [rdi] / jz

    Left-shifting eax / rax to line up x (a byte in RAX, no necessarily the lowest) with different positions of RCX (a[i+1..8]) and RSI (a[i+1..8] & a[i+2..9]) trades 2 shifts for 1 (plus 1 extra shift at the end to get the byte in RAX back down to the bottom.) It also increases critical path latency, with both shift and AND as part of the loop carried dependency chain for RAX, vs. shifting independent load results.

    Perhaps there's other work we could be doing in parallel with a 64-bit AND, like horizontal-ANDing the last 2 pairs of bytes? But getting a 0xFF into RAX and then extracting that higher byte as an early-out would take extra instructions to what, be able to maybe skip the last 4 bytes?

    prefix_AND_scan_zero:   ; const uint8_t a[]
       movzx eax, byte [rdi]     ; x = a[0]
    
    .outer_loop:
       mov   rcx, [rdi+1]
       mov   rsi, [rdi+2]        ; rorx  rsi, rcx, 8   ; last byte would be less useful.
       and   rsi, rcx            ; a1 &= a2,  a3 &= a4, etc.
    
    ;.inner_loop:   ; fully unrolled
       test  eax, ecx           ; x & a[i+1]
       jz   .first_element      ; RDI+1 is the pointer, AL holds the value
       mov   edx, eax           ; save old x before potentially zeroing it
       and   eax, esi           ; x &= (a[i+1] & a[i+2])
       jz   .second_element
    
       shr  rcx, 16            ; shift both the other things instead of rax,
       shr  rsi, 16            ;   keeping the critical path shorter
    
       test  eax, ecx          ; x & a[i+1]
       jz   .third_element      ; RDI+3 is the pointer, AL holds the value
       mov   edx, eax          ; save old x before potentially zeroing it
       and   eax, esi          ; x &= (a[i+1] & a[i+2])
       jz   .fourth_element
    
       add  rdi, 4
       shr  rcx, 16            ; a[i+1..8] >> 32
       shr  rsi, 16
      ;   shl  eax, 16            ; x now lives in the 3rd byte of RAX
                                 ; saves front-end bandwidth but lengthens critical path and requires separate branch targets to sort out where the value is.
    
       test  eax, ecx          ; x & a[i+1]
       jz   .first_element      ; RDI+1 is the pointer, AL holds the value
       mov   edx, eax          ; save old x before potentially zeroing it
       and   eax, esi          ; x &= (a[i+1] & a[i+2])
       jz   .second_element
    
       shr  ecx, 16            ; a[i+1..8] >> 48
       shr  esi, 16
    
       test  eax, ecx          ; x & a[i+1]
       jz   .third_element      ; RDI+1 is the pointer, AL holds the value
       mov   edx, eax          ; save old x before potentially zeroing it
       and   eax, esi          ; x &= (a[i+1] & a[i+2])
       jz   .fourth_element
    
       add   rdi, 4
           ;   shr   eax, 16     ; x back to the bottom of RAX if we were shifting it
    
       jmp  .outer_loop        ; trailing zeros in the array will make the loop exit without needing a counter
    
     .fourth_element:
         add  rdi, 4
     .second_element:
         ;; RDI+2 is the pointer, value is DL & [rdi+1] (the discarded result of the last TEST)
         ;; recomputing it here keeps the loop's critical path latency short
         movzx  eax, byte [rdi+1]
         and    eax, edx
         add    rdi, 2
         ret
    
     .third_element:
         add  rdi, 4
     .first_element:     ; RDI+1 is the pointer, AL holds the value
         inc  rdi
         ret
    

    (untested, and probably not optimal especially on Intel; too many shifts and branches competing for ports 0 and 6.)

    No outer-loop condition is needed as long as there's a guaranteed zero in the input array; we will definitely stop there, if not sooner, so no loop counter is needed, just pointer updates to track progress.

    uiCA says Ice Lake / Rocket Lake could run this at 7 cycles per iteration (8 bytes checked), just barely faster than one AND per cycle. On Skylake it runs into the JCC erratum.

    It's quite finicky on Ice Lake, at least in uiCA's simulation of the pipeline (assuming perfect branch prediction etc.) A version using some shl rax, 16 at some points instead of shr rcx,16 / shr rsi,16 is predicted to run at about 8c / iter on Ice Lake. With some tweaks (making two instructions longer with a REX prefix) that avoid the JCC erratum on Skylake, it slows down to 8.85c on ICL, but speeds up to about 10c on Skylake. (uiCA, with cleanup not adjusted to get the right byte out of RAX).

    But this approach is likely not good; perhaps better to work in chunks of 2 or 4 bytes, so more loads and less shifting. Perhaps even 2-byte loads into a register, then test al,cl / and cl, ch / and al, cl or something. Reading CH has a cycle extra latency on Intel, but doesn't cost throughput.