Vector Matrix multiplication via ARM NEON

I have a task - to multiply big row vector (10 000 elements) via big column-major matrix (10 000 rows, 400 columns). I decided to go with ARM NEON since I'm curious about this technology and would like to learn more about it.

Here's a working example of vector matrix multiplication I wrote:

//float* vec_ptr - a pointer to vector
//float* mat_ptr - a pointer to matrix
//float* out_ptr - a pointer to output vector
//int matCols - matrix columns
//int vecRows - vector rows, the same as matrix

for (int i = 0, max_i = matCols; i < max_i; i++) {
    for (int j = 0, max_j = vecRows - 3; j < max_j; j+=4, mat_ptr+=4, vec_ptr+=4) {
        float32x4_t mat_val = vld1q_f32(mat_ptr);    //get 4 elements from matrix
        float32x4_t vec_val = vld1q_f32(vec_ptr);    //get 4 elements from vector

        float32x4_t out_val = vmulq_f32(mat_val, vec_val);  //multiply vectors
        float32_t total_sum = vaddvq_f32(out_val);          //sum elements of vector together
        out_ptr[i] += total_sum;
    }

    vec_ptr = &myVec[0];   //switch ptr back again to zero element
}

The problem is that it's taking very long time to compute - 30 ms on iPhone 7+ when my goal is 1 ms or even less if it's possible. Current execution time is understandable since I launch multiplication iteration 400 * (10000 / 4) = 1 000 000 times.

Also, I tried to process 8 elements instead of 4. It seems to help, but numbers still very far from my goal.

I understand that I might make some horrible mistakes since I'm newbie with ARM NEON. And I would be happy if someone can give me some tip how I can optimize my code.

Also - is it worth doing big vector-matrix multiplication via ARM NEON? Does this technology fit well for such purpose?

Solution

Your code is completely flawed: it iterates 16 times assuming both matCols and vecRows are 4. What's the point of SIMD then?

And the major performance problem lies in float32_t total_sum = vaddvq_f32(out_val);:
You should never convert a vector to a scalar inside a loop since it causes a pipeline hazard that costs around 15 cycles everytime.

The solution:

    float32x4x4_t myMat;
    float32x2_t myVecLow, myVecHigh;

    myVecLow = vld1_f32(&pVec[0]);
    myVecHigh = vld1_f32(&pVec[2]);
    myMat = vld4q_f32(pMat);

    myMat.val[0] = vmulq_lane_f32(myMat.val[0], myVecLow, 0);
    myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[1], myVecLow, 1);
    myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[2], myVecHigh, 0);
    myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[3], myVecHigh, 1);

    vst1q_f32(pDst, myMat.val[0]);

Compute all the four rows in a single pass
Do a matrix transpose (rotation) on-the-fly by vld4
Do vector-scalar multiply-accumulate instead of vector-vector multiply and horizontal add that causes the pipeline hazards.

You were asking if SIMD is suitable for matrix operations? A simple "yes" would be a monumental understatement. You don't even need a loop for this.