I tried to compose a simple piece of NEON code but found it is slower than regular C++ implementation. The code is as below
float A[] = {1,2,3,4};
float B[] = {2,3,4,5};
float32x4_t v1;
float32x4_t v2;
int counter = 1000000;
while(counter--){
v1 = vld1q_f32(A);
v2 = vld1q_f32(B);
v = vaddq_f32(v1,v2);
vst1q_f32(A,v);
// A[0] = A[0]+B[0]; // regular implementation
// A[1] = A[1]+B[1]; // regular implementation
// A[2] = A[2]+B[2]; // regular implementation
// A[3] = A[3]+B[3]; // regular implementation
}
I searched for the reason so I guess it is because of the in-order pipeline and this simple task cause stall in CPU? But could anyone help explain more in detail? And is there any way to improve the performance of this NEON implementation? Or is it better to use the regular implementation than using NEON when facing this kind of simple task? Thank you.
Your test routine is completely flawed to start with:
Since all the inputs are clearly visible to the compiler at build time, the compiler will simply generate machine codes similar to the one below:
A[0] = 3.0f;
A[1] = 5.0f;
A[2] = 7.0f;
A[3] = 9.0f;
In order to prevent compilers from this kind of cheating, you have to hide the inputs:
void myFunc_c(float *pA, float *pB, uint32_t count)
{
if (count == 0) return;
do {
*pA++ += *pB++;
} while (--count);
}
void myFunc_neon(float *pA, float *pB, uint32_t count)
{
float32x4_t a, b;
count >>= 2;
if (count == 0) return;
do {
a = vld1q_f32(pA);
b = vld1q_f32(pB);
a = vaddq_f32(a, b);
vst1q_f32(pA, a);
pA += 4;
pB += 4;
} while (--count);
}
All you need to do is to allocate enough memory for pA
and pB
, initialize them if you want, and call the functions above.
I think the neon version will be roughly 3 times faster.