c multithreading performance parallel-processing openmp

Parallel code with different output on different machines

When I run the code below in two difference machines I get different output, in one the output is correct (sum = sum2) in the other it is not.

and I don't know why

#include <stdio.h>
#include <math.h>
#include <omp.h>

int main(){

    const int NX=1000;    
    const int NY=1000;

    float x[NX+2];          
    float y[NX+2];          
    float u[NX+2][NY+2];    

    float x2;   // 
    float y2;
    float sum;
    float sum2;

    for (int i=0; i<NX+2; i++){
      for (int j=0; j<NY+2; j++){
        x2      = i;
        y2      = j;
        u[i][j] = x2+ y2;
        sum += u[i][j];
      }
    }
    for (int i=0; i<NX+2; i++){
      #pragma omp parallel for
      for (int j=0; j<NY+2; j++){
        x2      = i;
        y2      = j;
        u[i][j] = x2+ y2;
      }
    }

    for (int i=0; i<NX+2;i++){
      for (int j=0; j<NY+2; j++){
        sum2 += u[i][j];
      }
    }

    printf("%f \n", sum);
    printf("%f", sum2);
}

Solution

You need to initialize the values of

float sum;
float sum2;

otherwise when the operations:

sum += u[i][j];

and

sum2 += u[i][j];

lead to undefined behaviour. That is why you are see two different results.

Set both variables to zero:

float sum = 0;
float sum2 = 0;

Compile your code with (at least) the flag -Wall. If you have done that you would have seen the following warning:

main.c:17:7: warning: 'sum2' may be used uninitialized in this function [-Wmaybe-uninitialized]
   17 | float sum2;
      |       ^~~~
main.c:16:7: warning: 'sum' may be used uninitialized in this function [-Wmaybe-uninitialized]
   16 | float sum;
      |       ^~~

Performance-wise instead of parallelizing the inner loop:

for (int i=0; i<NX+2; i++){
   #pragma omp parallel for
   for (int j=0; j<NY+2; j++){
      x2      = i;
      y2      = j;
      u[i][j] = x2+ y2;
    }
}

you should profile what happen when you parallelize both loops by using the OpenMP collapse option

#pragma omp parallel for collapse(2)
for (int i=0; i<NX+2; i++){       
   for (int j=0; j<NY+2; j++){
      u[i][j] = i + j;
    }
}

Even if the collapse clause is not an opinion (e.g., it is slower), performance-wise it would still be better to parallelize the outer loop rather than the inner loop. First, you avoid the overhead of creating the parallel region NX+2 times. Second, since the outer loop is iterating over columns and the inner loop over the rows, dividing the iterations of the first loop among threads reduces the likelihood of false-sharing.

Moreover, you can also parallelize the other two loops. However, you will need to use OpenMP reduction clause to avoid the race-condition during the updates of the sum and sum2 variables.

The final code would look like the following:

#include <stdio.h>
#include <math.h>
#include <omp.h>

int main(){

   const int NX=1000;    
   const int NY=1000;
   
   float u[NX+2][NY+2];    
   float sum = 0;
   float sum2 = 0;

   #pragma omp parallel for reduction(+:sum)
   for (int i=0; i<NX+2; i++){
     for (int j=0; j<NY+2; j++){
       sum += i+j;
     }
   }
   #pragma omp parallel for
   for (int i=0; i<NX+2; i++){
     for (int j=0; j<NY+2; j++){
        u[i][j] = i+j;
     }
   }

   #pragma omp parallel for reduction(+:sum2)
   for (int i=0; i<NX+2;i++){
     for (int j=0; j<NY+2; j++){
       sum2 += u[i][j];
     }
   }

   printf("%f \n", sum);
   printf("%f", sum2);
}