Search code examples
cassemblysimdarm64

Modulo on ARM SIMD Aarch64 (NEON)


I am learning about ARM-v8 Aarch64 SIMD instructions hoping I can optimize some calculations. In this case, I am looking for modulo operation on a 4xf32 vector.

How can I implement a modulo with the NEON instruction set?

Note: I actually am looking for something to make sure my angle values stay between -PI and +PI (cyclical, not clamping), so I am also interested in other solutions for that.

Note: currently I am trying to do it with the arm_neon.h header in C, but I might at some point do it directly with assembly for even more optimization of combining instructions without storing the results in variables.


Solution

  • The Armv8-A ASIMD instruction set extension does not have a modulo instruction, neither for floating point, nor for integer. However, for a divisor of 1, you can emulate modulo by rounding the number using a “convert to integer” and then subtracting from the rounded number, giving you the fractional part with the appropriate sign. You can then implement your modulo operation by these identities:

    fmod(a, 1) = a - round_towards_zero(a)
    fmod(a, b) = fmod(a/b, 1) * b
    

    Note that in your case, b is a constant, so this becomes:

    fmod(a, b) = a - round_towards_zero(a * 1/b) * b
    

    This then becomes three instructions: a multiplication of a and 1/b, a “round towards zero” and a “multiply and subtract” operation. For even better performance you should consider keeping your angles pre-scaled such that they are in the open interval of (−1, +1).

    Another thing to consider: if it is known that the angles are out of range by at most b, it might be faster to instead compare with ±b and add/subtract b conditionally if needed.