I want to speed up the mutiplication blow:
And here is the code that I need to improve:
void multiply(int size, int **matA, int **matB, int ** matC) {
for(int i=0;i<size;i++) {
for(int j=0;j<size;j++) {
int t = matC[i][j];
for(int k=0;k<size;k++) {
t += matA[i][k] * matB[k][j];
}
matC[i][j] += t;
}
}
}
I have two matrices and a result matrix with size 5000x5000. huge size of matrices may mean they can not be loaded in cache totally? Is there too many page fault in the for loop? I want to know how to speed up the multiplication, and how to organize data(use 1d array or 2D array?)
my answer code is list blow, I choose to use 1d array to simulate 2d array(use new []
one time per matrix), but I am not sure if it is faster when using 2d array.
and I use a temp matrix to store transposed matrix of matB
in order to avoid page fault in for loop.
and I add AVX2 to get more performance.
which is better, using 1d array or 2d array with size 5000x5000? and any other ideas or tips?
int** allocate(int rows, int cols) {
int ** mat;
mat = new int*[rows];
int *temp = new int[rows*cols];
for(int i = 0; i<rows; i++) {
mat[i] = temp + i * cols;
}
return mat;
}
void multiply(int size, int **matA, int **matB, int ** matC) {
int i;
int n = size*size; // total size
// column-major order
int **transMatB = allocate(size, size);
int *transArrB = transMatB[0];
//copy transposed data, maybe many page faults here.
#pragma omp parallel for
for(i = 0; i < n; i++) {
transMatB[i/size][i%size] = matB[i%size][i/size];
}
#pragma omp parallel for
for(i = 0; i < n; i ++) {
int *row = matA[i / size];
int *col = transMatB[i % size];
int temp;
#ifdef __AVX2__
temp = multiplyAndSumArrays(row, col, size);
#else
temp = 0;
for (int k = 0; k < size; k ++) {
temp += row[k] * col[k];
}
#endif
matC[i / size][i % size] += temp;
}
// remove temp transposed mastrix
delete[] transMatB[0];
delete[] transMatB; transMatB = nullptr;
}
Matrix-matrix multiplication is probably the most studied kernel when it comes to optimization. For the ultimate, read the paper by Goto and van de Geijn, citation below.
The crux is that
Specifically to that last point: in a nutshell, each of the 3 loops needs to be split into two loops, one over blocks, and one in blocks. And then you have 6 loops (meaning 5! different loop orderings or so) and the 3 blocksizes as tuning parameters. The above paper analyzes this completely.
Note that this is not simple! For a reasonably do-able solution, do a recursive 2x2 multiplication: divide each matrix as a 2x2 block structure, and multiply those blocks recursively. You stop the recursion when the blocks are small enough to fit in cache.
That should be doable as a class assignment giving a nice performance improvement. (Though not as high as the Goto algorithm!) You can even simply multi-thread this.
Goto, Kazushige / Geijn, Robert A. van de
Anatomy of high-performance matrix multiplication
2008
ACM Trans. Math. Softw. , Vol. 34, No. 3
ACM: New York, NY, USA p. 1-25