I'm trying to multiply two n by n matrices using omp in C. But my code only calculates the diagonal elements. Here is the code:
My code
# include <stdio.h>
# include <omp.h>
int main(void){
int N, element, tid;
int i, j;
scanf("%d", &N); // matrix dimension
int A[N][N]; // matrix A
int B[N][N]; // matrix B
int C[N][N]; // matrix C := AB
// declare elements matrix A
for(i = 0; i < N; i++){
for(j = 0; j < N; j++){
scanf("%d", &element);
A[i][j] = element;
}
}
// declare elements matrix B
for(i = 0; i < N; i++){
for(j = 0; j < N; j++){
scanf("%d", &element);
B[i][j] = element;
}
}
omp_set_num_threads(N); // set the number of threads
#pragma omp parallel default(none) shared(A, B, C, N) private(i, j, tid)
{
tid = omp_get_thread_num(); // returns the number of the currently executing thread within the team
#pragma omp for
for (i = 0; i < N; i++){
C[tid][i] = 0;
for (j = 0; j < N; j++){
//tid = i*n + j;
C[tid][i] += A[tid][j]*B[j][i];
}
}
}
// print matrix C
for(i = 0; i < N; i++){
for(j = 0; j < N; j++){
printf("%d ", C[i][j]);
}
printf("\n");
}
return(0);
}\
This is what I get with N = 2 and matrices A = {{1, 2}, {3, 4}} B = {{1, 2}, {3, 4}}
C = {{7, random number}, {random number, 22}}
Let's look at how your code behaves:
#pragma omp parallel default(none) shared(A, B, C, N) private(i, j, tid)
Creates N threads, that will all execute the next instructions.
#pragma omp for
for (i = 0; i < N; i++){
Distributes the N iterations to the N threads. With the default omp scheduling, the first thread (tid=0) will execute only the first iteration (i=0), the second thread (tid=1) only the second iteration (i=1), and so on... This means that in the loop body you always get i = tid
, and consequently C[tid][i]
addresses the diagonal elements only.
You should obtain the desired result by suppressing #pragma omp for
, so that the loop on i
is fully executed by each thread. But as mentioned in the comments, this approach with the creation N
threads will give poor performances as soon as N
will be larger then the number of cores (and won't use all the cores when N
will be smaller than the number of cores).
EDIT
The right way is instead to not change the default number of the threads (which is usually equal to the number of physical cores), and define a parallel loop that distribute the rows to the threads:
#pragma omp parallel default(none) shared(A, B, C, N)
{
#pragma omp for
for (int k = 0; k < N; i++) {
for (int i = 0; i < N; i++){
C[k][i] = 0;
for (int j = 0; j < N; j++){
C[k][i] += A[k][j]*B[j][i];
}
}
}
}
If N=15 for instance, and with 4 threads:
But note that this code is likely not efficient, even if executed serially without OpenMP. The memory access pattern on B
in the inner loop is poor, as it addresses non-contiguous elements in memory. For small N it doesn't matter, but for large N it could result in a lot of cache misses, then altered performances.