How to optimize the computation of a for loop using SIMD?

I am trying to accelerate a stereo matching algorithm on ODROID XU4 ARM platform using Neon SIMD. For this puropose I am using openMp's pragmas.

 void StereoMatch:: sadCol(uint8_t* leftRank,uint8_t* rightRank,const int SAD_WIDTH,const int SAD_WIDTH_STEP, const int imgWidth,int j, int d , uint16_t* cost) 
  {

   uint16_t sum = 0;
   int n = 0;
   int m =0;
      for ( n = 0; n < SAD_WIDTH+1; n++)
      {

     #pragma omp simd
     for(  m = 0; m< SAD_WIDTH_STEP; m = m + imgWidth ) 
         {


        sum += abs(leftRank[j+m+n]-rightRank[j+m+n-d]);

         };
         cost[n] = sum;
         sum = 0;



  };

I am fairly new to SIMD and openMp, I understood that using the SIMD pragma in the code will direct the compiler to vectorize the subtraction, but when I executed the code I noticed no difference. What should I add to my code in order to vectorize it ?

Solution

As said in the comments, ARM-Neon has an instruction which directly does what you want, i.e., compute the absolute difference of unsigned bytes and accumulates it to unsigned short-integers.

Assuming SAD_WIDTH+1==8, here is a very simple implementation using intrinsics (based on the simplified version by @nemequ):

void sadCol(uint8_t* leftRank,
            uint8_t* rightRank,
            int j,
            int d ,
            uint16_t* cost) {
    const int SAD_WIDTH = 7;
    const int imgWidth = 320;
    const int SAD_WIDTH_STEP = SAD_WIDTH * imgWidth;

    uint16x8_t cost_8 = {0};
    for(int m = 0; m < SAD_WIDTH_STEP; m = m + imgWidth )  {
        cost_8 = vabal_u8(cost_8, vld1_u8(&leftRank[j+m]), vld1_u8(&rightRank[j+m-d]));
    };
    vst1q_u16(cost, cost_8);
};

vld1_u8 loads 8 consecutive bytes, vabal_u8 computes the absolute difference and accumulates it to the first register. Finally, vst1q_u16 stores the register to memory.

You can easily make imgWidth and SAD_WIDTH_STEP function parameters. If SAD_WIDTH+1 is a different multiple of 8, you can write another loop for that.

I have no ARM platform at hand to test it, but "it compiles": https://godbolt.org/z/vPqiYI (and the assembly looks fine, in my eyes). If you optimize with -O3 gcc will unroll the loop.