Suggestion in ARM NEON optimization

For academic purposes I want to try to write an ARM NEON optimization of the following algorithm, even only to test if it is possible to obtain any performance improvement or not. I think this is not a good candidate for SIMD optimization because results are merged togheter losing parallelization gains.

This is the algorithm:

const uchar* center = ...;

int t0, t1, val;
t0 = center[0]; t1 = center[1];
val = t0 < t1;
t0 = center[2]; t1 = center[3];
val |= (t0 < t1) << 1;
t0 = center[4]; t1 = center[5];
val |= (t0 < t1) << 2;
t0 = center[6]; t1 = center[7];
val |= (t0 < t1) << 3;
t0 = center[8]; t1 = center[9];
val |= (t0 < t1) << 4;
t0 = center[10]; t1 = center[11];
val |= (t0 < t1) << 5;
t0 = center[12]; t1 = center[13];
val |= (t0 < t1) << 6;
t0 = center[14]; t1 = center[15];
val |= (t0 < t1) << 7;

d[i] = (uchar)val;

This is what I thought in ARM assembly:

VLD2.8 {d0, d1} ["center" addr]

supposing 8 bit chars, this first operation should load all the t0 and t1 values alternatively in 2 registers.

VCLT.U8 d2, d0, d1

a single operation of "less then" for all the comparisons. NOTES: I've read that VCLT is possible only with a #0 constant as second operand, so this must be inverted in a >=. Reading ARM documentation i think the result of every 8 bit value will be "all 1" for true (11111111) or "all 0" for false (00000000).

VSHR.U8 d4, d2, #7

this right shift will delete 7 out of 8 values in the register 8-bit "cells" (mainly to delete 7 ones). I've used d4 because of the next step will be the first d register mapped in q2.

Now problems begin: shifting and ORs.

VSHLL.U8 q2[1], d4[1], 1
VSHLL.U8 q2[2], d4[2], 2
...
VSHLL.U8 q2[7], d4[7], 7

I can imagine only this way (if it's possible to use [offsets]) for left shifts. Q2 should be specified instead of d4 according to the documentation.

VORR(.U8) d4[0], d4[1], d4[0]
VORR(.U8) d4[0], d4[2], d4[0]
...
VORR(.U8) d4[0], d4[7], d4[0]

Last step should give the result.

VST1.8 d4[0], [d[i] addr]

Simple store of the result.

It is my first approach to ARM NEON, so probably many assumptions may be incorrect. Help me understand possible errors, and suggest a better solution if possible.

EDIT: This is the final working code after the suggested solutions:

__asm__ __volatile ("VLD2.8 {d0, d1}, [%[ordered_center]] \n\t"
"VCGT.U8 d2, d1, d0 \n\t"
"MOV r1, 0x01 \n\t"
"MOV r2, 0x0200 \n\t"
"ORR r2, r2, r1 \n\t"
"MOV r1, 0x10 \n\t"
"MOV r3, 0x2000 \n\t"
"ORR r3, r3, r1 \n\t"
"MOVT r2, 0x0804 \n\t"
"MOVT r3, 0x8040 \n\t"
"VMOV.32 d3[0], r2 \n\t"
"VMOV.32 d3[1], r3 \n\t"
"VAND d0, d2, d3 \n\t"
"VPADDL.U8 d0, d0 \n\t"
"VPADDL.U16 d0, d0 \n\t"
"VPADDL.U32 d0, d0 \n\t"
"VST1.8 d0[0], [%[desc]] \n\t"
:
: [ordered_center] "r" (ordered_center), [desc] "r" (&desc[i])
: "d0", "d1", "d2", "d3", "r1", "r2", "r3");

Solution

After the comparison, you have an array of 8 booleans represented by 0xff or 0x00. The reason SIMD comparisons (on any architecture) produce those values is to make them useful for a bit-mask operation (and/or bit-select in NEON's case) so you can turn the result into an arbitrary value quickly, without a multiply.

So rather than reducing them to 1 or 0 and shifting them about, you'll find it easier to mask them with the constant 0x8040201008040201. Then each lane contains the bit corresponding to its position in the final result. You can pre-load the constant into another register (I'll use d3).

VAND d0, d2, d3

Then, to combine the results, you can use VPADD (instead of OR), which will combine adjacent pairs of lanes, d0[0] = d0[0] + d0[1], d0[1] = d0[2] + d0[3], etc... Since the bit patterns do not overlap there is no carry and add works just as well as or. Also, because the output is half as large as the input we have to fill in the second half with junk. I've used a second copy of d0 for that.

You'll need to do the add three times to get all columns combined.

VPADD.u8 d0, d0, d0
VPADD.u8 d0, d0, d0
VPADD.u8 d0, d0, d0

and now the result will now be in d0[0].

As you can see, d0 has room for seven more results; and some lanes of the VPADD operations have been working with junk data. It would be better if you could fetch more data at once, and feed that additional work in as you go so that none of the arithmetic is wasted.

EDIT

Supposing the loop is unrolled four times; with results in d4, d5, d6, and d7; the constant mentioned earlier should be loaded into, say, d30 and d31, and then some q register arithmetic can be used:

VAND q0, q2, q15
VAND q1, q3, q15

VPADD.u8 d0, d0, d1
VPADD.u8 d2, d2, d3
VPADD.u8 d0, d0, d2
VPADD.u8 d0, d0, d0

With the final result in d0[0..3], or simply the 32-bit value in d0[0].

There seem to be lots of registers free to unroll it further, but I don't know how many of those you'll use up on other calculations.