I have a matrix multiplication which looks like this:
void gemm_nn(int N, int K, float *A, float *B, float *C) {
int j, k;
for (k = 0; k < K; k++)
for (j = 0; j < N; j++)
C[j] += A[k] * B[k * N + j];
}
the float are single, 4 bytes, 32 bits.
I would like to optimize the loop with armv8-a 64-bit.
Could I load 4 consecutive floats in a single 128-bit register and does a single multiply-accumulative operation?
Could you point the instructions I should try to achieve this?
Native SIMD ld1 {v16.4s} and fmla instructions are what is needed.