Search code examples
loopsnestedfortranopenmp

How to fully parallelise sequential inner loops in Fortran with OpenMP


I'm attempting to parallelise some old fixed format Fortran code with OpenMP. I can't figure out how to fully parallelise the following structure of nested loops, with one outer loop and 2 sequential inner loops:

      do y = 1,ny
        do x = 1,nx
          calculation 1
        enddo
        intermediate calculation (calculation 1)
        do x = 1,nx
          calculation 2 (intermediate calculation)
        enddo
      enddo

calculation 2 is such that it cannot be included in the first inner loop, and must be included in the outer loop as opposed to in a separate outer loop. This is due to a dependency on an intermediate calculation, itself dependent on all values of calculation 1

I am compiling with gfortran, and setting the environment variable OMP_NUM_THREADS=4. The following code illustrates the 3 approaches I've tested:

c     Test of using OpenMP to parallelise sequential inner loops and an
c       outer loop
c     Use export OMP_NUM_THREADS=4
      program parallelTest

c     declarations
      implicit none
      integer nx, ny
      parameter (nx = 2, ny = 5)
      integer omp_get_thread_num, i, j, A(nx,ny), B(nx,ny), C(nx,ny),
     >    D(nx,ny), E(nx,ny), F(nx,ny)

c     initialisation
      A = 7
      B = 7
      C = 7
      D = 7
      E = 7
      F = 7

c     attempt 1: just executes first loop, 2nd loop is ignored
c$omp parallel do shared(A,B) private(i,j) schedule(static) collapse(2)
      do j = 1,ny
        do i = 1,nx
          A(i,j) = omp_get_thread_num()
        enddo
        do i = 1,nx
          B(i,j) = omp_get_thread_num()
        enddo
      enddo
c$omp end parallel do

c     attempt 2: only parallelises outer loop
c$omp parallel do shared(C,D) private(i,j) schedule(static)
      do j = 1,ny
        do i = 1,nx
          C(i,j) = omp_get_thread_num()
        enddo
        do i = 1,nx
          D(i,j) = omp_get_thread_num()
        enddo
      enddo
c$omp end parallel do

c     attempt 3: only parallelises inner loops
c$omp parallel shared(E,F) private(i,j)
      do j = 1,ny
c$omp   do schedule(static)
        do i = 1,nx
          E(i,j) = omp_get_thread_num()
        enddo
c$omp   end do
c$omp   do schedule(static)
        do i = 1,nx
          F(i,j) = omp_get_thread_num()
        enddo
c$omp   end do
      enddo
c$omp end parallel

c     print output to terminal
      do i = 1,nx
        print *, 'A(', i, ',:) = ', A(i,:)
      enddo
      print *
      do i = 1,nx
        print *, 'B(', i, ',:) = ', B(i,:)
      enddo
      print *
      print *
      do i = 1,nx
        print *, 'C(', i, ',:) = ', C(i,:)
      enddo
      print *
      do i = 1,nx
        print *, 'D(', i, ',:) = ', D(i,:)
      enddo
      print *
      print *
      do i = 1,nx
        print *, 'E(', i, ',:) = ', E(i,:)
      enddo
      print *
      do i = 1,nx
        print *, 'F(', i, ',:) = ', F(i,:)
      enddo

      end

This gives the following output:

 A(           1 ,:) =            0           0           1           2           3
 A(           2 ,:) =            0           1           1           2           3

 B(           1 ,:) =            7           7           7           7           7
 B(           2 ,:) =            7           7           7           7           7


 C(           1 ,:) =            0           0           1           2           3
 C(           2 ,:) =            0           0           1           2           3

 D(           1 ,:) =            0           0           1           2           3
 D(           2 ,:) =            0           0           1           2           3


 E(           1 ,:) =            0           0           0           0           0
 E(           2 ,:) =            1           1           1           1           1

 F(           1 ,:) =            0           0           0           0           0
 F(           2 ,:) =            1           1           1           1           1

In approach 1, the outer loop and first inner loop are parallelised as I expect, but the 2nd loop doesn't execute (presumably because the do doesn't immediately follow a c$omp do?). In approach 2, only the outer loop is parallelised. In approach 3, only the inner loops are parallelised.

My question is: how do I get matrix B to be the same as matrix A? This seems like it should be a straightforward task; I'm assuming there's an OpenMP clause or structure I'm not utilising, but would much appreciate some pointing in the right direction of what to search for (new to OpenMP!).


Solution

  • If the information flow is as you have described it, I would re-write your loops as

    do y = 1,ny
      do x = 1,nx
        calculation 1
      enddo
    enddo
    
    do y = 1,ny
      intermediate calculation (calculation 1)
    enddo
    
    do y = 1,ny
      do x = 1,nx
        calculation 2 (intermediate calculation)
      enddo
    enddo
    

    And then parallelise each y loop separately.