Search code examples
cloopsparallel-processingdependenciesopenmp

Parallelisation in nested loops with OPENMP


I'm trying to parallelize to this function in C using openmp but nothing seems to work out , I'm aware there are dependencies in the iterations and the two lines also but I'm stuck, can someone help me if possible.

void loop_reference(double C[N][N], double A[N][N], double B[N][N]) {
  size_t i, j;

  for (i = 1; i < N; i++) {
    for (j = 1; j < N; j++) {
      B[i][j] = B[i-1][j] + A[i][j-1];
      A[i][j] = A[i-1][j] + C[i][j];
    }
  }
}

I've tried separating the two lines and parallelize each loop but it doesn't work.

void loop_parallel(double C[N][N], double A[N][N], double B[N][N]) {
  size_t i, j;

  #pragma omp parallel for private(i, j) shared(A, B, C)
  for (i = 1; i < N; i++) {
    for (j = 1; j < N; j++) {
      A[i][j] = A[i-1][j] + C[i][j];
      
    }
  }

  #pragma omp parallel for private(i, j) shared(A, B, C)
  for (i = 1; i < N; i++) {
    for (j = 1; j < N; j++) {
      A[i][j] = A[i-1][j] + C[i][j];
    }
  }
}

Solution

  • Your problem is that your algorithm can not be directly parallelized. Make a picture of the flow of data. Your a[0,0] is used by a[0,1] and a[1,0] so neither the outer nor the inner loop has independent iterations.

    You need to rewrite this completely by iterating over "diagonals" in i,j space. First compute a[0,0], then a[1,0], a[0,1], then a[2,0], a[1,1], a[0,2]. In other words:

    for (int d=0... 2N)
      for ( i,j such that i+j=s )
        compute a[i,j]
    

    That gives you a sequential outer, and parallel inner loop. Full parallelism is not possible.

    Of course, if this is a smoothing step in some iterative process you are actually not interested in complete reproducibility of the sequential result. In that case blithely declaring the whole loop nest parallel will correspond to "chaotic iteration" and the outer process may still converge. A little less drastic, if this is indeed a smoother, replace your lexicographic ordering by a red-black one, and you get a lot more parallelism. Again, without exactly reproducing the result of one sequential smoothing step in your original code.