c optimization signal-processing simd neon

Using NEON instructions to speed up cascaded biquads - how it works?

I am trying to understand how the cascaded biquad filtering is optimized for Arm processors in CMSIS using Neon extensions. The code is ifdefed under #if defined(ARM_MATH_NEON) here, and documentation is here.

The NEON intrinsics are used when there are more than 4 biquads cascaded. I am puzzled how could any kind of parallel instruction execution be done if output from one biduaq is fed as input to the next one? Could anyone explain what is done in parallel in that peace of code?

Solution

A biquad cascade can be parallelized by offsetting them in time.

If you compute 4 biquads at a time, the last cascade biquad doesn’t operate on the results from the previous biquad in the same batch of 4, but on results saved from the previous batch of 4. That removes the dependencies within each batch. Thus it takes 4 steps of latency to propagate data diagonally from the first to the last biquad, but thruput finishes 4 biquads per time step, or 4x higher thruput than computing biquads one at a time.