NEON : Swap 4 scalars in float32x4

I used the following code to swap 4 scalars in float32x4_t vector. {1,2,3,4} -> {4,3,2,1}

float32x4_t Vec = {1,2,3,4};
float32x4_t Rev = vrev64q_f32 (Vec); //{2,1,4,3}
High = vget_high_f32 (Rev); //{4,3}
Low = vget_low_f32 (Rev); //{1,2}
float32x4_t Swap = vcombine_f32 (High, Low); //{4,3,2,1}

Can you suggest a faster code ?

Thank you, Zvika

Solution

That is possibly as good as it gets.

The reverse engineered code (for aarch64, gcc/clang -O3) would be

vec = vrev64q_f32(vec);
return vextq_f32(vec,vec,2);

On armv7 (gcc 11.2) your original version compiles to

    vrev64.32       q0, q0
    vswp    d0, d1

where as the other more compact version compiles to

    vrev64.32       q0, q0
    vext.32 q0, q0, q0, #2

If you prefer the vswp approach (only on armv7) keep your code as is, since there are no intrinsics for swaps.

On armv7 you could also use

float32x2_t lo = vrev64_f32(vget_high_f32(vec));
float32x2_t hi = vrev64_f32(vget_low_f32(vec));
return vcombine_f32(lo, hi);

When inlined and when the result can be produced on another register, this can compile just to two instructions with no dependency between them. Permutations on Cortex-A7 are typically 1 cycle / 64 bits, with 4 cycle latency, so this could be twice as fast as the other approaches.