I'm working on code optimization for ARM processors using NEON. However I have a problem: my algorithm contains the following floating point computation:
round(x*b - y*a)
Where results can be both positive and negative.
Actually I'm using 2 VMUL and 1 VSUB to make parallel computation (4 values per operation using Q registers and 32bit floats).
There is a way I can handle this problem? If the results were all the same sign I know I can simply add or subtract 0.5
First, NEON suffers from long latency especially after float multiplications. You won't gain very much with two vmuls and one vsub due to this compared to vfp programming.
Therefore, your code should look like :
vmul.f32 result, x, b
vmls.f32 result, y, a
Those multiply-accumulate/substract instructions are issued back-to-back with the previous multiply instruction without any latency. (9 cycles saved in this case)
Unfortunately however, I don't understand your actual question. Why would someone want to round float values? Apparently you intend to extract the integer part rounded, and there are several ways to do this, and I cannot tell you anything more cause your question is as always too vague.
I've been following your questions in this forum for quite some time, and I simply cannot get rid of the feeling that you're lacking something very fundamental.
I suggest you to read the assembly reference guide pdf from ARM first.