Neon Optimization for multiplication and store in ARM

Using an ARM Cortex A15 board I'm trying to optimize a perfectly working C code by using NEON intrinsics.

compiler: gcc 4.7 on ubuntu 12.04

Flags:-g -O3 -mcpu=cortex-a15 -mfpu=neon-vfpv4 -ftree-vectorize -DDRA7XX_ARM -DARM_PROC -DSL -funroll-loops -ftree-loop-ivcanon -mfloat-abi=hard

I wanted to do the following function ,its just a simple load->multiply->store.

here are some parameters: *input is a pointer to an array of size 40680 and after completing the loop the pointer should retain the current position and do the same for next input stream via input pointer.

            float32_t A=0.7;
            float32_t *ptr_op=(float*)output[9216];
            float32x2_t reg1;

             for(i= 0;i< 4608;i+=4){        
                /*output[(2*i)] = A*(*input); // C version
                input++;                 
                output[(2*i)+1] = A*(*input);
                input++;*/

                reg1=vld1q_f32(input++);    //Neon version              
                R_N=vmulq_n_f32(reg1,A);
                vst1q_f32(ptr_op++,R_N);
            }

I want to understand where am I making mistake in this loop because it seems pretty straightforward.

Here is my assembly implementation of the same . Am I going in the correct direction???

__asm__ __volatile__(
              "\t mov r4, #0\n"
              "\t vdup.32 d1,%3\n"
              "Lloop2:\n"
              "\t cmp r4, %2\n"
              "\t bge Lend2\n"
              "\t vld1.32  d0, [%0]!\n"             
              "\t vmul.f32 d0, d0, d1\n"
              "\t vst1.32 d0, [%1]!\n"
              "\t add r4, r4, #2\n"
              "\t b Lloop2\n"
              "Lend2:\n"
              : "=r"(input), "=r"(ptr_op), "=r"(length), "=r"(A)
              : "0"(input), "1"(ptr_op), "2"(length), "3"(A)
              : "cc", "r4", "d1", "d0");

Solution

Hmmmmm, does your code compile in the first place? I didn't know that you can multiply a vector by a float scalar. Probably the compiler did convert if for you.

Anyway, you have to understand that most NEON instructions are bound with a long latency. Unless you hide them properly, your code won't be any faster than the standard C version, if not slower.

vld1q..... // 1 cycle
// 4 cycles latency + potential cache miss penalty
vmulq..... // 2 cycles
// 6 cycles latency
vst1q..... // 1 cycle
// 2 cycles loop overhead

The example above roughly shows the cycles required for each iteration.

And as you can see, it's minimum 18 cycles/iteration from which only 4 cycles are spent on actual computation while 14 cycles are wasted meaninglessly.

It's called RAW dependency (Read after Write)

The easiest and practically only way to hide these latencies is loop unrolling: a deep one.

Unrolling by four vectors per iteration is usually sufficient, and eight is even better, if you don't mind the code length.

void vecMul(float * pDst, float * pSrc, float coeff, int length)
{
    const float32x4_t scal = vmovq_n_f32(coeff);
    float32x4x4_t veca, vecb;

    length -= 32;

    if (length >= 0)
    {
        while (1)
        {
            do
            {
                length -= 32;
                veca = vld1q_f32_x4(pSrc++);
                vecb = vld1q_f32_x4(pSrc++);

                veca.val[0] = vmulq_f32(veca.val[0], scal);
                veca.val[1] = vmulq_f32(veca.val[1], scal);
                veca.val[2] = vmulq_f32(veca.val[2], scal);
                veca.val[3] = vmulq_f32(veca.val[3], scal);
                vecb.val[0] = vmulq_f32(vecb.val[0], scal);
                vecb.val[1] = vmulq_f32(vecb.val[1], scal);
                vecb.val[2] = vmulq_f32(vecb.val[2], scal);
                vecb.val[3] = vmulq_f32(vecb.val[3], scal);

                vst1q_f32_x4(pDst++, veca);
                vst1q_f32_x4(pDst++, vecb);
            } while (length >= 0);

            if (length <= -32) return;

            pSrc += length;
            pDst += length;
        }
    }

///////////////////////////////////////////////////////////////

    if (length & 16)
    {
        veca = vld1q_f32_x4(pSrc++);
    }

    if (length & 8)
    {
        vecb.val[0] = vld1q_f32(pSrc++);
        vecb.val[1] = vld1q_f32(pSrc++);
    }

    if (length & 4)
    {
        vecb.val[2] = vld1q_f32(pSrc++);
    }

    if (length & 2)
    {
        vld1q_lane_f32(pSrc++, vecb.val[3], 0);
        vld1q_lane_f32(pSrc++, vecb.val[3], 1);
    }

    if (length & 1)
    {
        vld1q_lane_f32(pSrc, vecb.val[3], 2);
    }

    veca.val[0] = vmulq_f32(veca.val[0], scal);
    veca.val[1] = vmulq_f32(veca.val[1], scal);
    veca.val[2] = vmulq_f32(veca.val[2], scal);
    veca.val[3] = vmulq_f32(veca.val[3], scal);
    vecb.val[0] = vmulq_f32(vecb.val[0], scal);
    vecb.val[1] = vmulq_f32(vecb.val[1], scal);
    vecb.val[2] = vmulq_f32(vecb.val[2], scal);
    vecb.val[3] = vmulq_f32(vecb.val[3], scal);

    if (length & 16)
    {
        vst1q_f32_x4(pDst++, veca);
    }

    if (length & 8)
    {
        vst1q_f32(pDst++, vecb.val[0]);
        vst1q_f32(pDst++, vecb.val[1]);
    }

    if (length & 4)
    {
        vst1q_f32(pDst++, vecb.val[2]);
    }

    if (length & 2)
    {
        vst1q_lane_f32(pDst++, vecb.val[3], 0);
        vst1q_lane_f32(pDst++, vecb.val[3], 1);

    }

    if (length & 1)
    {
        vst1q_lane_f32(pDst, vecb.val[3], 2);
    }
}

Now we are dealing with eight independent vectors, hence the latencies are completely hidden, and the potential cache miss penalty as well as the flat loop overhead are rather diminishing.