Search code examples
armneonintrinsicscortex-a8

Using ARM NEON intrinsics to add alpha and permute


I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components?

void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix)
{
    numPix /= 8; //process 8 pixels at a time

    uint8x8_t alpha = vdup_n_u8 (0xff);

    for (int i=0; i<numPix; i++)
    {
        uint8x8x3_t rgb  = vld3_u8 (src);
        uint8x8x4_t bgra;

        bgra.val[0] = rgb.val[2]; //these lines are slow
        bgra.val[1] = rgb.val[1]; //these lines are slow 
        bgra.val[2] = rgb.val[0]; //these lines are slow

        bgra.val[3] = alpha;

        vst4_u8(dst, bgra);

        src += 8*3;
        dst += 8*4;
    }


}

Solution

  • The ARMCC disassembly isn't that fast either :

    • It isn't using the most appropriate instructions

    • It mixes VFP instructions with NEON ones which causes huge hiccups every time

    Try this :

      mov r2, r2, lsr #3
      vmov.u8, d3, #0xff
    loop:
      vld3.8 {d0-d2}, [r0]!
      subs r2, r2, #1
      vswp d0, d2
      vst4.8 {d0-d3}, [r1]!
      bgt loop
    
      bx lr
    

    My suggested code isn't fully optimized either, but further "real" optimizations would harm the readability seriously. So I stop here.