Possible to parallelize filter for ARM NEON?

I am trying to figure out if and how a specific existing code can be parallelized for use in an ARM Cortex-A9 NEON SIMD unit. This is the code:

for(int i=0; i < 11; i++)
{
    f4UF1 *= F[i];

    A[i][2] = A[i][1];
    A[i][1] = A[i][0];
    A[i][0] = f4UF1;

    B[i][2] = B[i][1];
    B[i][1] = B[i][0];

    C[i] = 0;

    C[i] += D[i][0] * A[i][0];
    C[i] += D[i][1] * A[i][1];
    C[i] += D[i][2] * A[i][2];

    C[i] -= E[i][1] * B[i][1];
    C[i] -= E[i][2] * B[i][2];

    B[i][0] = C[i] / E[i][0];

    f4UF1 = B[i][0];
}

I have looked at the code for quite a bit now and I am almost sure that it cannot be parallelized efficiently, but I thought, I could give it a try to ask here. I am not expecting ready code, just ideas on how to do it. Thanks :)

Solution

So yes, this does look like a biquad for which the coefficients are changed for each sample, perhaps because you are smoothing them.

As a commenter mentioned, you probably want to pre-compute the 1/E[i][0] scaling factor and perhaps roll it into the other coefficients to reduce the number of multiplies, especially on floating point platforms. You can also often normalize the biquad to get rid of the D[i][0] as well (making it 1.0), and just apply a scalar to the whole output.

And of course, you probably have realized that you want to keep everything in registers during the loop and then only write them out to memory after the loop is done... ;-)

After that, there are two vectorization techniques that I'm aware of (though I'm interested in Nils' ideas as well):

Channel vectorization - the easiest. If you need to apply filters to multiple data sets at once (very common for stereo audio for example), you can operate two sets of coefficients with two sets of audio data at the same time. I've found that Neon provides just about the right number of registers for two channels if you are using all SP floating point. Instant 2x speedup really.
Loop unrolling. This gets a little tricky to describe in detail here, but fortunately there is a nice page here: http://reanimator-web.appspot.com/articles/simdiir. This technique adds pole/zero pairs to essentially compute more samples at once. However, the extra poles of course add extra conditions to the stability of the filter and so you have to be careful. In your case, when the coefficients seem to be dynamic, this is probably some kind of nightmare to ensure.