Parallelizing for the Fortran Nested loop

I used OpenMP with COLLAPSE clause to parallelise the nested the Fortran loop. The code can be compiled successfully but it gives me the following error message when I run it.

Loading compiler-rt version 2022.0.2
Loading oclfpga version 2022.0.2
  Load "debugger" to debug DPC++ applications with the gdb-oneapi debugger.
  Load "dpl" for additional DPC++ APIs: https://github.com/oneapi-src/oneDPL
Loading mkl version 2022.0.2
Loading mpi version 2021.5.1

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 15005 RUNNING AT login
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

[1]+  Done                    ./run.sh

Here is the nested part in my code

PROGRAM CODE
USE MPI
IMPLICIT NONE
INTEGER, PARAMETER              ::             dp = SELECTED_REAL_KIND(15,14)
INTEGER                         ::             ra(2)                         
REAL (KIND=dp)                  ::             ch(3)                         
REAL (KIND=dp)                  ::             it                            
REAL (KIND=dp)                  ::             mi                            
INTEGER                         ::             nm                            
REAL (KIND=dp)                  ::             te                            
INTEGER                         ::             km(3)                         
INTEGER                         ::             ps(3)                         
REAL (KIND=dp)                  ::             lr(3,3)                       
INTEGER                         ::             na                            
REAL (KIND=dp), ALLOCATABLE     ::             ap(:,:)                       
INTEGER, ALLOCATABLE            ::             ns(:,:)                       
INTEGER                         ::             po(3)                         
INTEGER                         ::             nb                            
INTEGER, ALLOCATABLE            ::             tv(:,:)                       
COMPLEX (KIND=dp), ALLOCATABLE  ::             ok(:,:,:)                     
COMPLEX (KIND=dp), ALLOCATABLE  ::             oo(:,:,:)                     
INTEGER                         ::             lwork, info
INTEGER, ALLOCATABLE            ::             ipiv(:)
COMPLEX (KIND=dp), ALLOCATABLE  ::             work(:)
INTEGER                         ::             i, j, k, l, m                 
REAL (KIND=dp), ALLOCATABLE     ::             kp(:,:)                       
COMPLEX (KIND=dp), ALLOCATABLE  ::             ks(:,:,:)                     
COMPLEX (KIND=dp), ALLOCATABLE  ::             ol(:,:,:)                     
REAL (KIND=dp), ALLOCATABLE     ::             ei(:)                         
COMPLEX (KIND=dp), ALLOCATABLE  ::             th(:,:)                       
COMPLEX (KIND=dp), ALLOCATABLE  ::             ll(:,:,:)                     
COMPLEX (KIND=dp), ALLOCATABLE  ::             sr(:,:)                       
COMPLEX (KIND=dp), ALLOCATABLE  ::             rl(:,:,:)                     
COMPLEX (KIND=dp), ALLOCATABLE  ::             ls(:,:)                       
COMPLEX (KIND=dp), ALLOCATABLE  ::             rs(:,:)                       
COMPLEX (KIND=dp), ALLOCATABLE  ::             rg(:,:)                       
COMPLEX (KIND=dp), ALLOCATABLE  ::             ag(:,:)                       
COMPLEX (KIND=dp), ALLOCATABLE  ::             tr(:,:)                       
COMPLEX (KIND=dp)               ::             ic                            
COMPLEX (KIND=dp)               ::             pf                            
REAL (KIND=dp)                  ::             fd(3)                         
COMPLEX (KIND=dp), ALLOCATABLE  ::             t1(:,:,:,:)                   
COMPLEX (KIND=dp), ALLOCATABLE  ::             t2(:)                         
COMPLEX (KIND=dp), ALLOCATABLE  ::             t3(:)                         
INTEGER                         ::             omp_get_num_threads           
INTEGER                         ::             omp_get_thread_num            
INTEGER                         ::             num_thread                    
INTEGER                         ::             id_thread                     
k = ra(2) - ra(1)
ALLOCATE (ll(5,k,k))
ALLOCATE (sr(k,k))
ALLOCATE (rl(5,k,k))
ALLOCATE (ls(k,k))
ALLOCATE (rs(k,k))
ALLOCATE (rg(k,k))
ALLOCATE (ag(k,k))
ALLOCATE (tr(k,k))
ALLOCATE (ipiv(k))
ALLOCATE (work(2*k-1))
ALLOCATE (th(nb,nb))
lwork = 2*k-1
!$OMP PARALLEL DEFAULT(NONE) SHARED(num_thread)
num_thread = omp_get_num_threads()
!$OMP END PARALLEL
ALLOCATE (t1(nm,km(1)*km(2)*km(3),k,k))
t1 = DCMPLX(0.0d0, 0.0d0)
ALLOCATE (t2(nm))
t2 = DCMPLX(0.0d0, 0.0d0)
ALLOCATE (t3(nm))
t3 = DCMPLX(0.0d0, 0.0d0)

!$OMP PARALLEL DO PRIVATE(i,j,l,m,pf,th,ll,sr,rl,ls,rs,rg,ag,ipiv,work,lwork,info,fd,tr)&
!$OMP             SHARED(po,km,kp,nm,nb,ei,ic,tv,lr,ol,ks,ra,k,ch,te) COLLAPSE(2)&
!$OMP             REDUCTION(+:t1,t2,t3)
DO i = 1, nm, 1
   DO j = 1, km(1)*km(2)*km(3), 1
      DO l = 1, nb, 1
         DO m = 1, po(1)*po(2)*po(3), 1
            pf = EXP(ic*(kp(j,1)*(tv(1,m)*lr(1,1)+tv(2,m)*lr(2,1)+tv(3,m)*lr(3,1))+kp(j,2)*(tv(1,m)*lr(1,2)+tv(2,m)*lr(2,2)+tv(3,m)*lr(3,2))+kp(j,3)*(tv(1,m)*lr(1,3)+tv(2,m)*lr(2,3)+tv(3,m)*lr(3,3))))
            th(:,l) = pf*((ei(i)+ic)*ol(:,l,m)-ks(:,l,m))
         END DO
      END DO
      ll(1,:,:) = th(ra(1)-k:ra(2)-k,ra(1)-k*2:ra(2)-k*2)
      ll(2,:,:) = th(ra(1)-k:ra(2)-k,ra(1)-k:ra(2)-k)    
      ll(3,:,:) = th(ra(1)-k*2:ra(2)-k*2,ra(1)-k:ra(2)-k)
      ll(4,:,:) = th(ra(1)-k:ra(2)-k,ra(1):ra(2))        
      ll(5,:,:) = th(ra(1):ra(2),ra(1)-k:ra(2)-k)        
      sr = th(ra(1):ra(2),ra(1):ra(2))                   
      rl(1,:,:) = th(ra(1)+k:ra(2)+k,ra(1)+k*2:ra(2)+k*2)
      rl(2,:,:) = th(ra(1)+k:ra(2)+k,ra(1)+k:ra(2)+k)    
      rl(3,:,:) = th(ra(1)+k*2:ra(2)+k*2,ra(1)+k:ra(2)+k)
      rl(4,:,:) = th(ra(1):ra(2),ra(1)+k:ra(2)+k)        
      rl(5,:,:) = th(ra(1)+k:ra(2)+k,ra(1):ra(2))        
      ls = MATMUL(MATMUL(ll(1,:,:),ll(2,:,:)),ll(3,:,:)) 
      ll(2,:,:) = ll(2,:,:) - ls                         
      rs = MATMUL(MATMUL(rl(1,:,:),rl(2,:,:)),rl(3,:,:)) 
      rl(2,:,:) = rl(2,:,:) - rs                         
      rg = DCMPLX(0.0d0, 0.0d0)
      rg = sr-ls-rs
      CALL ZGETRF(ra(2)-ra(1),ra(2)-ra(1),rg,l,ipiv,info)
      CALL ZGETRI(ra(2)-ra(1),rg,ra(2)-ra(1),ipiv,work,lwork,info)
      DO l = 1, k, 1
         ag(l,:) = DCONJG(rg(:,l))
      END DO
      fd(1) = 1.0d0/(EXP((ei(i)-ch(1))/te)+1.0d0)
      fd(2) = 1.0d0/(EXP((ei(i)-ch(3))/te)+1.0d0)
      IF (fd(1) > fd(2)) THEN
         fd(3) = fd(1)
         fd(1) = fd(2)
         fd(2) = fd(3)
         fd(3) = fd(2)-fd(1)
      ELSE IF (fd(1) == fd(2)) THEN
              fd(3) = 1.0d0
      ELSE
         fd(3) = fd(2)-fd(1)
      END IF
      t1(i,j,:,:) = t1(i,j,:,:)+fd(3)*MATMUL(MATMUL(rg,ls+rs),ag)/2.0d0-fd(1)*IMAG(rg)
      tr = MATMUL(MATMUL(MATMUL(ls,ag),rs),rg)
      DO l = 1, k, 1
         t2(i) = t2(i)+tr(l,l)
      END DO
      t2(i) = t2(i)*fd(3)
      tr = rg - ag
      DO l = 1, k, 1
         t3(i) = t3(i)+tr(l,l)
      END DO
   END DO
END DO
!$OMP END PARALLEL DO

100 STOP
END PROGRAM CODE

I am not sure whether this problem comes from the limited memory on my computer because it still persists after I reduce the computational loading (smaller values for 'nm' and 'km' variables in the nested Fortran loop). Would anyone please give me some suggestions on how to resolve this problem?

I add a small complete example, which has the same structure in the code above, and put it below. Would anyone please have a look at it and give me some suggestions; particularly, the COLLAPSE and REDUCTION clause in the OpenMP for the nested loop? Do you think it is correctly written? Thank you.

PROGRAM TES
IMPLICIT NONE
INTEGER, PARAMETER              ::             dp = SELECTED_REAL_KIND(15,14)
INTEGER                         ::             i, j, k, l, m
INTEGER                         ::             na = 177
INTEGER                         ::             nm = 100
INTEGER                         ::             km(3)
INTEGER                         ::             po(3)
REAL (KIND=dp), ALLOCATABLE     ::             kp(:,:)
COMPLEX (KIND=dp), ALLOCATABLE  ::             th(:,:)
COMPLEX (KIND=dp)               ::             ic = DCMPLX(0.0d0, 1.0d0)
COMPLEX (KIND=dp), ALLOCATABLE  ::             t1(:,:,:,:)
COMPLEX (KIND=dp), ALLOCATABLE  ::             t2(:)
COMPLEX (KIND=dp), ALLOCATABLE  ::             t3(:)
INTEGER                         ::             omp_get_num_threads
INTEGER                         ::             omp_get_thread_num
INTEGER                         ::             num_thread
INTEGER                         ::             id_thread

km(1) = 100
km(2) = 1
km(3) = 75
ALLOCATE (kp(km(1)*km(2)*km(3),3))
DO i = 1, km(1), 1
   DO j = 1, km(2), 1
      DO k = 1, km(3), 1
         kp((i-1)*km(2)*km(3)+(j-1)*km(3)+k,1) = DBLE(i)
         kp((i-1)*km(2)*km(3)+(j-1)*km(3)+k,2) = DBLE(j)
         kp((i-1)*km(2)*km(3)+(j-1)*km(3)+k,3) = DBLE(k)
      END DO
   END DO
END DO
po(1) = 1
po(2) = 0
po(3) = 5
ALLOCATE (th(na,na))
ALLOCATE (t1(nm,km(1)*km(2)*km(3),na,na))
ALLOCATE (t2(nm))
ALLOCATE (t3(nm))
!$OMP PARALLEL DEFAULT(NONE) SHARED(num_thread)
num_thread = omp_get_num_threads()
!$OMP END PARALLEL
!$OMP PARALLEL DO PRIVATE(i,j,k,l,m,th)&
!$OMP             SHARED(km,kp,nm,po,ic) COLLAPSE(2)&
!$OMP             REDUCTION(+:t1,t2,t3)
DO i = 1, nm, 1
   DO j = 1, km(1)*km(2)*km(3), 1
      th = DCMPLX(0.0d0, 0.0d0)
      DO k = 1, na, 1
         DO l = 1, na, 1
            DO m = 1, (po(1)*2+1)*(po(2)*2+1)*(po(3)*2+1), 1
               th(l,k) = th(l,k) + kp(j,1)*k + kp(j,2)*ic - kp(j,3)*l
            END DO
         END DO
      END DO
      t1(i,j,:,:) = th*(i-j+ic)
      DO k = 1, na, 1
         t2(i) = t2(i) + th(k,k)*(i*ic-j)
      END DO
      DO k = na, 1, -1
         t3(i) = t3(i) + th(k,na-k+1)*(i*ic+j)
      END DO
   END DO
END DO
!$OMP END PARALLEL DO
OPEN (UNIT=3, FILE='output.dat', STATUS='UNKNOWN')
DO i = 1, nm, 1
   DO j = 1, km(1)*km(2)*km(3), 1
      DO k = 1, na, 1
         WRITE (UNIT=3, FMT=*) t1(i,j,k,:)
      END DO
   END DO
END DO
WRITE (UNIT=3, FMT=*)
DO i = 1, nm, 1
   WRITE (UNIT=3, FMT=*) t2(i)
END DO
WRITE (UNIT=3, FMT=*)
DO i = 1, nm, 1
   WRITE (UNIT=3, FMT=*) t3(i)
END DO
CLOSE (UNIT=3)

DEALLOCATE (kp)
DEALLOCATE (th)
DEALLOCATE (t1)
DEALLOCATE (t2)
DEALLOCATE (t3)

STOP
END PROGRAM TES

Solution

I could reproduce the problem, and understand what was happening. But retrospectively it could have been obvious by just reading the code of your reproducible example.

The point is that your t1 is huge. With the dimensions in the example it has 100x100x1x75177177 ~ 23.5 Gelements, i.e. 188GB. Maybe you have the memory to afford it, but after that you are asking for a reduction on t1: when reducting a variable/array, OpenMP creates private copies to each thread, exactly the same as if you were specifying private(t1). So, if you have 8 threads OpenMP requires an additional 8x188=1504 GB amount of memory: you very likely do not have that amount... And even if you had, there would be a specific issue with ifort: it allocates private arrays on the stack memory, which is limited. gfortran allocates them on the heap memory if needed, so you are only limited by the total amount of RAM in this case. There is an ifort switch to force the allocation on the heap, though...

The good news is that you don't need the reduction on t1 at all: you are updating it with t1(i,j,:,:) = ..., and because you are collapsing the loops on i and j, the different threads won't write simultaneously to the same locations. Problem solved...

Still, you need the reduction of t2 and t3, but these ones are small enough and this won't be an issue (however you have to initialize them before the parallel region: you cannot assume the content is zero after the allocation, it is actually undefined).

More observations on your code:

when privatising an allocatable array with the private clause (such as th), it's better to allocate it inside the parallel region (because it ensures that the private copies are on the heap rather than on the stack). This requires separating the OMP PARALLEL and OMP DO directives:

!$OMP PARALLEL    PRIVATE(i,j,k,l,m,th)  &
!$OMP             SHARED(km,kp,nm,po,ic) &
!$OMP             REDUCTION(+:t1,t2,t3)
allocate( th(na,na) )
!$OMP DO          COLLAPSE(2)&
do
...
end do
!$OMP END DO
deallocate( th )
!$OMP EN PARALLEL

The memory pattern access on t1 is awful, even for the serial code. You should allocate it that way:

ALLOCATE (t1(na,na,km(1)*km(2)*km(3),nm))

then in the computations you would access it this way:

t1(:,:,j,i) = ...

which is much better as t1(:,:,j,i) is a contiguous section in memory, thus minimizing the cache misses, and as j is the inner loop (keeping the higher rank index as constant as possible within the inner loop is also cache friendly).

Last but not not least: do you really need such a huge array? By reorganizing the code, maybe (but I don't know if it's possible) you could compute it by parts, so with less memory needed