I am trying to figure out if and how a specific existing code can be parallelized for use in an ARM Cortex-A9 NEON SIMD unit. This is the code:
for(int i=0; i < 11; i++)
{
f4UF1 *= F[i];
A[i][2] = A[i][1];
A[i][1] = A[i][0];
A[i][0] = f4UF1;
B[i][2] = B[i][1];
B[i][1] = B[i][0];
C[i] = 0;
C[i] += D[i][0] * A[i][0];
C[i] += D[i][1] * A[i][1];
C[i] += D[i][2] * A[i][2];
C[i] -= E[i][1] * B[i][1];
C[i] -= E[i][2] * B[i][2];
B[i][0] = C[i] / E[i][0];
f4UF1 = B[i][0];
}
I have looked at the code for quite a bit now and I am almost sure that it cannot be parallelized efficiently, but I thought, I could give it a try to ask here. I am not expecting ready code, just ideas on how to do it. Thanks :)
So yes, this does look like a biquad for which the coefficients are changed for each sample, perhaps because you are smoothing them.
As a commenter mentioned, you probably want to pre-compute the 1/E[i][0]
scaling factor and perhaps roll it into the other coefficients to reduce the number of multiplies, especially on floating point platforms. You can also often normalize the biquad to get rid of the D[i][0]
as well (making it 1.0
), and just apply a scalar to the whole output.
And of course, you probably have realized that you want to keep everything in registers during the loop and then only write them out to memory after the loop is done... ;-)
After that, there are two vectorization techniques that I'm aware of (though I'm interested in Nils' ideas as well):