Search code examples
armneon

NEON : Swap 4 scalars in float32x4


I used the following code to swap 4 scalars in float32x4_t vector. {1,2,3,4} -> {4,3,2,1}

float32x4_t Vec = {1,2,3,4};
float32x4_t Rev = vrev64q_f32 (Vec); //{2,1,4,3}
High = vget_high_f32 (Rev); //{4,3}
Low = vget_low_f32 (Rev); //{1,2}
float32x4_t Swap = vcombine_f32 (High, Low); //{4,3,2,1}

Can you suggest a faster code ?

Thank you, Zvika


Solution

  • That is possibly as good as it gets.

    The reverse engineered code (for aarch64, gcc/clang -O3) would be

    vec = vrev64q_f32(vec);
    return vextq_f32(vec,vec,2);
    

    On armv7 (gcc 11.2) your original version compiles to

        vrev64.32       q0, q0
        vswp    d0, d1
    

    where as the other more compact version compiles to

        vrev64.32       q0, q0
        vext.32 q0, q0, q0, #2
    

    If you prefer the vswp approach (only on armv7) keep your code as is, since there are no intrinsics for swaps.

    On armv7 you could also use

    float32x2_t lo = vrev64_f32(vget_high_f32(vec));
    float32x2_t hi = vrev64_f32(vget_low_f32(vec));
    return vcombine_f32(lo, hi);
    

    When inlined and when the result can be produced on another register, this can compile just to two instructions with no dependency between them. Permutations on Cortex-A7 are typically 1 cycle / 64 bits, with 4 cycle latency, so this could be twice as fast as the other approaches.