I wrote a serial and parallel code for matrix multiplication and I compute the time in the serial code and got 4 seconds. However, in the parallel code, when I run it using for example 4 threads and compute the time and I get more than 20 and every time the number of threads is increased, the time is increased too. So I want to know what's wrong. here is the openmp code:
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
#include <time.h>
int main (int argc, char *argv[]) {
int r1,c1;
int r2,c2;
int i,j,k;
int **mat1;
int **mat2;
int **result;
srand(time(0));
double time_spent = 0;
printf("Enter dimensions of the first matrix: \n");
scanf("%d%d",&r1,&c1);
mat1 = (int **)malloc(r1 * sizeof(int*));
for(i=0;i<r1;i++)
mat1[i] = (int *)malloc(c1 * sizeof(int));
for(i=0;i<r1;i++)
for(j=0;j<c1;j++)
mat1[i][j] = (rand() % (10 - 1 + 1)) + 1;
printf("Enter dimensions of the second matrix: \n");
scanf("%d%d",&r2,&c2);
mat2 = (int **)malloc(r2 * sizeof(int*));
for(i=0;i<r2;i++)
mat2[i] = (int *)malloc(c2 * sizeof(int));
for(i=0;i<r2;i++)
for(j=0;j<c2;j++)
mat2[i][j] = (rand() % (10 - 1 + 1)) + 1;
result = (int **)malloc(r1 * sizeof(int*));
for(i=0;i<r1;i++)
result[i] = (int *)malloc(c2 * sizeof(int));
#pragma omp parallel private(i, j, k) shared(mat1, mat2, result)
{
clock_t begin = clock();
#pragma omp for schedule(static)
for(i = 0;i<r1;i++){
for(j = 0;j<c2;j++){
for(k=0;k<r2;k++){
result[i][j] += mat1[i][k] * mat2[k][j];
}
}
}
clock_t end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
}
printf("Time elapsed: %f\n", time_spent);
printf("\n");
/*for(i=0;i<r1;i++){
for(j=0;j<c2;j++){
printf("%d ",result[i][j]);
}
printf("\n");
}*/
for(i=0;i<r1;i++)
free(mat1[i]);
free(mat1);
for(i=0;i<r2;i++)
free(mat2[i]);
free(mat2);
for(i=0;i<r1;i++)
free(result[i]);
free(result);
}
Q : "Proper way to compute time in openmp"
OpenMP has nothing to do with this, this is related to the indirect accounting of how many CPU-ticks (stored inside CPU-core low-level hardware registers) have been accumulated from (all) threads, which actually can run in multiple instances at the same time. Compare that to the OpenMP native tool :double omp_get_wtime(void);
You may like to experiment with more run-time options live here, to see all the improvement impacts from better, more cache-efficient RAM-I/O and other OpenMP scheduling, thread-capacites, sharing-avoidance and other options of choice.
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//
// Omp-out-most-BLOCK
// Time elapsed: 0.004587 [ 123 x 123] -03
// Time elapsed: 0.029240 [ 123 x 123]
// Time elapsed: 5.266409 [1234 x 1234] -O3 [ i, j, k ] ~ User time: 2.729 s Real time: 1.481 s
// Time elapsed: 46.048513 [1234 x 1234]
// Time elapsed: 73.393222 [1234 x 1234] -O3 [ j, k, i ]
// Time elapsed: 86.988589 [1234 x 1234] -O3 [ k, j, i ] ~ User time: 43.613 s Real time: 22.411 s
// w/o Omp
// a pure-[SERIAL] Time elapsed: 0.001580 [ 123 x 123] -03
// Time elapsed: 0.010290 [ 123 x 123]
// Time elapsed: 4.075591 [1234 x 1234] -O3 [ i, j, k ] ~ User time: 4.209 s Real time: 4.296 s
// Time elapsed: 23.437123 [1234 x 1234] [ i, j, k ] ~ User time: 23.520 s Real time: 23.716 s
// Time elapsed: 42.685109 [1234 x 1234] [ k, j, i ] ~ User time: 42.757 s Real time: 43.187 s
//