When I run the code below in two difference machines I get different output, in one the output is correct (sum = sum2) in the other it is not.
and I don't know why
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main(){
const int NX=1000;
const int NY=1000;
float x[NX+2];
float y[NX+2];
float u[NX+2][NY+2];
float x2; //
float y2;
float sum;
float sum2;
for (int i=0; i<NX+2; i++){
for (int j=0; j<NY+2; j++){
x2 = i;
y2 = j;
u[i][j] = x2+ y2;
sum += u[i][j];
}
}
for (int i=0; i<NX+2; i++){
#pragma omp parallel for
for (int j=0; j<NY+2; j++){
x2 = i;
y2 = j;
u[i][j] = x2+ y2;
}
}
for (int i=0; i<NX+2;i++){
for (int j=0; j<NY+2; j++){
sum2 += u[i][j];
}
}
printf("%f \n", sum);
printf("%f", sum2);
}
You need to initialize the values of
float sum;
float sum2;
otherwise when the operations:
sum += u[i][j];
and
sum2 += u[i][j];
lead to undefined behaviour. That is why you are see two different results.
Set both variables to zero:
float sum = 0;
float sum2 = 0;
Compile your code with (at least) the flag -Wall. If you have done that you would have seen the following warning:
main.c:17:7: warning: 'sum2' may be used uninitialized in this function [-Wmaybe-uninitialized]
17 | float sum2;
| ^~~~
main.c:16:7: warning: 'sum' may be used uninitialized in this function [-Wmaybe-uninitialized]
16 | float sum;
| ^~~
Performance-wise instead of parallelizing the inner loop:
for (int i=0; i<NX+2; i++){
#pragma omp parallel for
for (int j=0; j<NY+2; j++){
x2 = i;
y2 = j;
u[i][j] = x2+ y2;
}
}
you should profile what happen when you parallelize both loops by using the OpenMP collapse option
#pragma omp parallel for collapse(2)
for (int i=0; i<NX+2; i++){
for (int j=0; j<NY+2; j++){
u[i][j] = i + j;
}
}
Even if the collapse
clause is not an opinion (e.g., it is slower), performance-wise it would still be better to parallelize the outer loop rather than the inner loop. First, you avoid the overhead of creating the parallel region NX+2
times. Second, since the outer loop is iterating over columns and the inner loop over the rows, dividing the iterations of the first loop among threads reduces the likelihood of false-sharing.
Moreover, you can also parallelize the other two loops. However, you will need to use OpenMP reduction clause to avoid the race-condition during the updates of the sum and sum2 variables.
The final code would look like the following:
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main(){
const int NX=1000;
const int NY=1000;
float u[NX+2][NY+2];
float sum = 0;
float sum2 = 0;
#pragma omp parallel for reduction(+:sum)
for (int i=0; i<NX+2; i++){
for (int j=0; j<NY+2; j++){
sum += i+j;
}
}
#pragma omp parallel for
for (int i=0; i<NX+2; i++){
for (int j=0; j<NY+2; j++){
u[i][j] = i+j;
}
}
#pragma omp parallel for reduction(+:sum2)
for (int i=0; i<NX+2;i++){
for (int j=0; j<NY+2; j++){
sum2 += u[i][j];
}
}
printf("%f \n", sum);
printf("%f", sum2);
}