How to speed up dense matrix multiplication?

I want to speed up the mutiplication blow:

And here is the code that I need to improve:

void multiply(int size, int **matA, int **matB, int ** matC) {
    for(int i=0;i<size;i++) {
        for(int j=0;j<size;j++) {
            int t = matC[i][j];
            for(int k=0;k<size;k++) {
                t += matA[i][k] * matB[k][j];
            }
            matC[i][j] += t;
        }
    }
}

I have two matrices and a result matrix with size 5000x5000. huge size of matrices may mean they can not be loaded in cache totally? Is there too many page fault in the for loop? I want to know how to speed up the multiplication, and how to organize data(use 1d array or 2D array?)

my answer code is list blow, I choose to use 1d array to simulate 2d array(use new [] one time per matrix), but I am not sure if it is faster when using 2d array. and I use a temp matrix to store transposed matrix of matB in order to avoid page fault in for loop. and I add AVX2 to get more performance.

which is better, using 1d array or 2d array with size 5000x5000? and any other ideas or tips?

int** allocate(int rows, int cols) {
    int ** mat;
    mat = new int*[rows];
    int *temp = new int[rows*cols];
    for(int i = 0; i<rows; i++) {
        mat[i] = temp + i * cols;
    }
    return mat;
}


void multiply(int size, int **matA, int **matB, int ** matC) {
int i;
int n = size*size; // total size


// column-major order
int **transMatB = allocate(size, size);
int *transArrB = transMatB[0];

//copy transposed data, maybe many page faults here.
#pragma omp parallel for
    for(i = 0; i < n; i++) {
        transMatB[i/size][i%size] = matB[i%size][i/size];
    }

#pragma omp parallel for
    for(i = 0; i < n; i ++) {
        int *row = matA[i / size];
        int *col = transMatB[i % size];
        int temp;
#ifdef __AVX2__
        temp = multiplyAndSumArrays(row, col, size);
#else
        temp = 0;
        for (int k = 0; k < size; k ++) {
            temp += row[k] * col[k];
        }
#endif
        matC[i / size][i % size] += temp;
    }
    // remove temp transposed mastrix
    delete[] transMatB[0];
    delete[] transMatB; transMatB = nullptr;
}

Solution

Matrix-matrix multiplication is probably the most studied kernel when it comes to optimization. For the ultimate, read the paper by Goto and van de Geijn, citation below.

The crux is that

this algorithm has n^3 operations on n^2 data so it has potential for data reuse in caches.
the simple 3-loop version has no re-use so you need to do something.
fortunately, all the output components are independent and so you can do major shuffling of the operations, for instance by loop interchanges.

Specifically to that last point: in a nutshell, each of the 3 loops needs to be split into two loops, one over blocks, and one in blocks. And then you have 6 loops (meaning 5! different loop orderings or so) and the 3 blocksizes as tuning parameters. The above paper analyzes this completely.

Note that this is not simple! For a reasonably do-able solution, do a recursive 2x2 multiplication: divide each matrix as a 2x2 block structure, and multiply those blocks recursively. You stop the recursion when the blocks are small enough to fit in cache.

That should be doable as a class assignment giving a nice performance improvement. (Though not as high as the Goto algorithm!) You can even simply multi-thread this.

Goto, Kazushige / Geijn, Robert A. van de 
Anatomy of high-performance matrix multiplication 
2008 

ACM Trans. Math. Softw. , Vol. 34, No. 3 
ACM: New York, NY, USA p. 1-25