android performance algorithm image-processing neon

Image resizing using ARM NEON

I'm trying to implement a row-by-row version of this image downscaling algorithm: http://intel.ly/1avllXm , applied to RGBA 8bit images.

To simplify, consider resizing a single row, w_src -> w_dst. Then each pixel may contribute its value to a single output accumulator with weight 1.0, or contribute to two consecutive output pixels with weights alpha and (1.0f - alpha). In C/pseudo-code:

float acc[w_dst] = malloc(w_dst * 4);
x_dst = 0
for x = 0 .. w_src:
  if x is a pivot column:
     acc[x_dst] += (w_src[x] * alpha);
     x_dst++;
     acc[x_dst] += (w_src[x] * (1.0f - alpha);
  else
     acc[x_dst] += w_src[x];

Finally, divide each accumulator channel by the number of source pixels contributing to it (a float val):

uint8_t dst = malloc(w_dst);
for x_dst = 0 .. w_dst
  dst[x_dst] = (uint8_t)round(acc[x_dst] / area);

My reference pure C implementation works correctly. However, I've wondered if there's a way to speed things up using NEON operations (remember that each pixel is 8bit RGBA). Thanks!

Solution

Unfortunately, NEON isn't very well suited for this kind of job. If it was image resizing with fixed source and destination resolutions, it would be possible to NEONize with dynamic vectors, but summing variable number of adjacent pixels isn't simply SIMDable.

I suggest replacing float arithmetic with fixed point one. That alone will help a lot.

Besides, division takes terribly long. It really harms the performance especially when done inside a loop. You should replace it with a multiplication like :

uint8_t dst = malloc(w_dst);
float area_ret = 1.0f/area;
for x_dst = 0 .. w_dst
  dst[x_dst] = (uint8_t)round(acc[x_dst] * area_ret);