Search code examples
assemblyarmxorneoncortex-a8

XOR all elements/lanes of NEON vector/register (pairwise?) in assembly on ARM Cortex A8


I'm not sure what the exact nomenclature is here, but here's the question:

I'm working on a checksum, and I want to take a number of different [32 bit] values, store them in the elements of a NEON vector(s), XOR them together, and then pass the results back to an ARM register for future computation. [The checksum has a number of different blocks based on a nonce, so I want to XOR these secondary results "into" the nonce, without losing entropy].

I'm not worried about performance (although less operations is always preferable, as is minimizing stalls of the ARM; the NEON can stall all it needs to), or the fact that this is not a particularly vectorizable operation; I need to use the NEON unit for this.

It would be ideal if there were some sort of horizontal XOR, wherein it would XOR the [4] elements of the vector with each other, and return the result, but that doesn't exist. I could obviously do something like (excuse the brutal pseudo-code):

load value1 s0
load value2 s2
veon d2, d0, d1
load value3 s0
load value4 s2
veon d0, d0,d1
veon d0, d0, d2

But is there a better way? I know there's pairwise addition, but seemingly no pairwise XOR. I'm flexible as far as using as many register lanes or registers as possible.

TL;DR: I need to do: res = val1 ^ val2 ^ val3 ^ val4 on the NEON, which is probably dumb, but I'm looking for the least-dumb way of doing it.

Thanks!


Solution

  • The NEON way of doing it. Need to unroll the loop for better performance because it tries to use data which takes time to load.

    vld1.u32 {q0},[r0]!        ; load 4 32-bit values into Q0
    veor.u32 d0,d0,d1          ; XOR 2 pairs of values (0<-2, 1<-3)
    vext.u8 d1,d0,d0,#4    ; shift down "high" value of d0
    veor.u32 d0,d0,d1          ; now element 0 of d0 has all 4 values XOR'd together
    vmov.u32 r2,d0[0]          ; transfer back to an ARM register
    str r2,[r1]!           ; store in output
    

    The ARM way of doing it. Loads the data a little slower, but doesn't have the delay of waiting for the transfer from NEON to ARM registers.

    ldmia r0!,{r4-r7}      ; load 4 32-bit values
    eor r4,r4,r5
    eor r4,r4,r6
    eor r4,r4,r7           ; XOR all 4 values together
    str r4,[r1]!           ; store in output
    

    If you can count on doing multiple groups of 4 32-bit values, then NEON can give you an advantage by loading up a bunch of registers, then processing them. If you're just calling a function which will work on 4 integers, then performance of the ARM version may be better.