I used the following code to swap 4 scalars in float32x4_t vector. {1,2,3,4} -> {4,3,2,1}
float32x4_t Vec = {1,2,3,4};
float32x4_t Rev = vrev64q_f32 (Vec); //{2,1,4,3}
High = vget_high_f32 (Rev); //{4,3}
Low = vget_low_f32 (Rev); //{1,2}
float32x4_t Swap = vcombine_f32 (High, Low); //{4,3,2,1}
Can you suggest a faster code ?
Thank you, Zvika
That is possibly as good as it gets.
The reverse engineered code (for aarch64, gcc/clang -O3) would be
vec = vrev64q_f32(vec);
return vextq_f32(vec,vec,2);
On armv7 (gcc 11.2) your original version compiles to
vrev64.32 q0, q0
vswp d0, d1
where as the other more compact version compiles to
vrev64.32 q0, q0
vext.32 q0, q0, q0, #2
If you prefer the vswp
approach (only on armv7) keep your code as is, since there are no intrinsics for swaps.
On armv7 you could also use
float32x2_t lo = vrev64_f32(vget_high_f32(vec));
float32x2_t hi = vrev64_f32(vget_low_f32(vec));
return vcombine_f32(lo, hi);
When inlined and when the result can be produced on another register, this can compile just to two instructions with no dependency between them. Permutations on Cortex-A7 are typically 1 cycle / 64 bits, with 4 cycle latency, so this could be twice as fast as the other approaches.