I am trying to implement matrix multiplication with multiple threads. Everything seems to work correctly, however, it work much slower than the usual algorithm. Here is my code
public class Main {
private static int nRows = 500; //number of rows and columns in matrices
private static int[][] matrix1 = new int[nRows][nRows]; //first matrix for multiplication
private static int[][] matrix2 = new int[nRows][nRows]; //second matrix for multiplication
private static int[][] result1 = new int[nRows][nRows]; //result from linear matrix multiplication
private static int[][] result2 = new int[nRows][nRows]; //result from parallel matrix multiplication
private static Thread[][] pool = new Thread[nRows][nRows]; //array of threads
//method used for transposing a matrix to get its column easily
public static int[][] transpose(int[][] matrix) {
int[][] newMatrix = new int[matrix[0].length][matrix.length];
for (int i = 0; i < matrix[0].length; i++) {
for (int j = 0; j < matrix.length; j++) {
newMatrix[i][j] = matrix[j][i];
}
}
return newMatrix;
}
public static void main(String[] args) {
//initializing input matrices (setting all elements = 1)
for (int i = 0; i < nRows; i++) {
for (int j = 0; j < nRows; j++) {
matrix1[i][j] = 1;
matrix2[i][j] = 1;
}
}
long start;
long end;
System.out.println("Linear algorithm");
start = System.currentTimeMillis();
//linear multiplication algorithm
for (int i = 0; i < nRows; i++) {
for (int j = 0; j < nRows; j++) {
int temp = 0;
for (int k = 0; k < nRows; k++) {
temp += matrix1[i][k] * matrix2[k][j];
}
result1[i][j] = temp;
}
}
//show result
// for(int i=0;i<nRows;i++){
// for(int j=0;j<nRows;j++){
// System.out.print(result1[i][j] + " ");
// }
// System.out.println();
// }
end = System.currentTimeMillis();
System.out.println("Time with linear algorithm: " + (end - start));
//--------------------
System.out.println("Parallel algorithm");
start = System.currentTimeMillis();
int[][] matrix3 = transpose(matrix2); //get a transpose copy of second matrix
for (int i = 0; i < nRows; i++) {
for (int j = 0; j < nRows; j++) {
pool[i][j] = new myThread(matrix1[i], matrix3[j], i, j); //creating a thread for each element
pool[i][j].start(); //starting a thread
}
}
for (int i = 0; i < nRows; i++) {
for (int j = 0; j < nRows; j++) {
try {
pool[i][j].join(); //waiting for the thread to finish its job
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
//show the result
// for(int i=0;i<nRows;i++){
// for(int j=0;j<nRows;j++){
// System.out.print(result2[i][j] + " ");
// }
// System.out.println();
// }
end = System.currentTimeMillis();
System.out.println("Time with parallel algorithm: " + (end - start));
}
//class, where parallel multiplication is implemented
private static class myThread extends Thread {
private int[] row = new int[nRows]; //row for multiplication
private int[] col = new int[nRows]; //column for multiplication
private int i; //row index of the element in resulting matrix
private int j; //column index of the element in resulting matrix
//constructor
public myThread(int[] r, int[] c, int i, int j) {
row = r;
col = c;
this.i = i;
this.j = j;
}
public void run() {
int temp = 0;
for (int k = 0; k < nRows; k++) {
temp += row[k] * col[k]; //getting the element by multiplying row and column of two matrices
}
result2[i][j] = temp; //writing the resulting element to the resulting matrix
}
}
}
Here, I create a new thread for each element in the resulting matrix. I than write these threads to an array, start them and, finally, wait for them to finish working. I've seen some realizations, where the whole input matrix (both of them) would be given as parameters to the thread. My task is, however, to come up with an algorithm, where only one row and one column (that are necessary for this particular element) are given.
After measuring the time elapsed I get following results
Linear algorithm
Time with linear algorithm: 557
Parallel algorithm
Time with parallel algorithm: 38262
What am I doing wrong? Thanks in advance!
The code you've written will work fine on a GPU where the concept of threads is very different and the overhead is basically zero. On CPU-based systems, spawning threads is an exceptionally slow operation and it only makes sense if you can amortise this overhead over a lot of computational work.
Here is some general advice that will help you write better parallel algorithms for CPUs:
result2
fall in the same cache line (cache lines on x86 and ARM are 64 bytes long, int
is 4 bytes) and are written by 16 different threads. The use of a temporary summation variable alleviates this problem somehow--it is much more severe when false sharing happens repeatedly in an inner(-most) loop.In summary, start as many threads as physical cores and have them work on big contiguous chunks of the input matrices.