XOR all elements/lanes of NEON vector/register (pairwise?) in assembly on ARM Cortex A8

I'm not sure what the exact nomenclature is here, but here's the question:

I'm working on a checksum, and I want to take a number of different [32 bit] values, store them in the elements of a NEON vector(s), XOR them together, and then pass the results back to an ARM register for future computation. [The checksum has a number of different blocks based on a nonce, so I want to XOR these secondary results "into" the nonce, without losing entropy].

I'm not worried about performance (although less operations is always preferable, as is minimizing stalls of the ARM; the NEON can stall all it needs to), or the fact that this is not a particularly vectorizable operation; I need to use the NEON unit for this.

It would be ideal if there were some sort of horizontal XOR, wherein it would XOR the [4] elements of the vector with each other, and return the result, but that doesn't exist. I could obviously do something like (excuse the brutal pseudo-code):

load value1 s0
load value2 s2
veon d2, d0, d1
load value3 s0
load value4 s2
veon d0, d0,d1
veon d0, d0, d2

But is there a better way? I know there's pairwise addition, but seemingly no pairwise XOR. I'm flexible as far as using as many register lanes or registers as possible.

TL;DR: I need to do: res = val1 ^ val2 ^ val3 ^ val4 on the NEON, which is probably dumb, but I'm looking for the least-dumb way of doing it.

Thanks!

Solution

The NEON way of doing it. Need to unroll the loop for better performance because it tries to use data which takes time to load.

vld1.u32 {q0},[r0]!        ; load 4 32-bit values into Q0
veor.u32 d0,d0,d1          ; XOR 2 pairs of values (0<-2, 1<-3)
vext.u8 d1,d0,d0,#4    ; shift down "high" value of d0
veor.u32 d0,d0,d1          ; now element 0 of d0 has all 4 values XOR'd together
vmov.u32 r2,d0[0]          ; transfer back to an ARM register
str r2,[r1]!           ; store in output

The ARM way of doing it. Loads the data a little slower, but doesn't have the delay of waiting for the transfer from NEON to ARM registers.

ldmia r0!,{r4-r7}      ; load 4 32-bit values
eor r4,r4,r5
eor r4,r4,r6
eor r4,r4,r7           ; XOR all 4 values together
str r4,[r1]!           ; store in output

If you can count on doing multiple groups of 4 32-bit values, then NEON can give you an advantage by loading up a bunch of registers, then processing them. If you're just calling a function which will work on 4 integers, then performance of the ARM version may be better.