Optimize gemm (matrix multiplication) with Neon aarch64

I have a matrix multiplication which looks like this:

void gemm_nn(int N, int K, float *A, float *B, float *C) {
    int j, k;
    for (k = 0; k < K; k++)
        for (j = 0; j < N; j++)
            C[j] += A[k] * B[k * N + j];
}

the float are single, 4 bytes, 32 bits.

I would like to optimize the loop with armv8-a 64-bit.

Could I load 4 consecutive floats in a single 128-bit register and does a single multiply-accumulative operation?

Could you point the instructions I should try to achieve this?

Solution

Native SIMD ld1 {v16.4s} and fmla instructions are what is needed.