performance for-loop parallel-processing fortran openmp

fortran triply nested do loop omp parallelization false sharing removal

I have the following fortran subroutine, which gprof shows that it consumes > 50% of the time and is by far the most expensive part of the code. Initially I received it without any OpenMP directive. My attempt to make this subroutine faster is to use OpenMP as below. I can request maximum 24 cores from 1 node of a HPC machine. I compile with ifort.

I want to ask you:

Is the writing at the end false-sharing? I think so, in Fortran the first index varies the fastest, and I parallelize the outermost i do-loop, thus according to MY UNDERSTANDING of Chapter 6 of Parallel Programming in OpenMP by Rohit Chandra et al., page 191, I am in a bad position.
Is there an intelligent way to somehow rephrase this triply nested do-loop in order to benefit from OpenMP as much as possible?
Is the simple do-loop over j (the wrk(j) = (-1)**(j-1) * ri(j) * dsqrt( rx(xi(j)) * wi(j) ) instruction) parallelized well? Is there anything to think about there?

The variables are as follows:

Nb = an integer roughly 1000 - 1500 - 2000
lmax = an integer roughly 70 - 100 - 150
nphi = 2 * lmax + 2
Nbk = an integer roughly 1000 - 1500 - 2000 (- 2500)
Np = Nb + 1

    ! Subroutine rtop(): transforms psi0(r, l, m) to psi1(p, l, m) 
    !------------------------------------------------------------------
            subroutine rtop(psi0, psi1, xi, wi, ri, BJ)
            use gridvars
            use cvars
            implicit none
    
            complex*16 psi0(Nb, 0:lmax, 0:nphi), psi1(Nbk, 0:lmax, 0:nphi)
            real*8 xi(0:Np), wi(0:Np), ri(Nb), BJ(Nbk, Nb, 0:lmax)
    
            ! local
            integer i, j, l, m
            real*8 wrk(Nb)
            complex*16 wrk2(Nb)
            real*8 :: rx ! function
    
            !$OMP PARALLEL DO PRIVATE(j) SCHEDULE(STATIC)
            do j = 1, Nb
              wrk(j) = (-1)**(j-1) * ri(j) * dsqrt( rx(xi(j)) * wi(j) )
            end do
            !$OMP END PARALLEL DO
    
    
            !$OMP PARALLEL DO PRIVATE(i, l, m, wrk2) SCHEDULE(STATIC)
            do i = 1, Nbk
    
              do l = 0, lmax
                wrk2(1:Nb) = (-ai)**l * BJ(i, 1:Nb, l) * wrk(1:Nb)
                do m = 0, nphi-1
                  ! multiply spherical Bessel functions
                  ! integrate over r
                  psi1(i, l, m) = sum( psi0(1:Nb, l, m) * wrk2(1:Nb) )
                end do ! m
              end do ! l
    
            end do ! i
            !$OMP END PARALLEL DO
    
            psi1(:, :, :) = dsqrt(2d0/pi) * psi1(:, :, :)
    
            return
            end subroutine rtop

Thank you a lot!

Solution

Is the writing at the end false-sharing? I think so, in Fortran the first index varies the fastest, and I parallelize the outermost i do-loop.

Unfortunately, yes. However, the false sharing is not the biggest issue. Having non contiguous access can already decrease performance.

A solution is to create a transposed version of psi1 with the shape (0:nphi, 0:lmax, Nbk) and change the loop so you write in psi1(m, l, i). Then, if needed, you can transpose the array after the parallel loop. The transposition can be done in parallel though this operation generally do not scale. Note that writing a fast transposition is not trivial, especially for non cubic 3D arrays : one need to care about the CPU cache hierarchy. A fast transposition should use a tile-based approach to be cache-friendly. The Intel manuals provide a relatively-fast 2D transposition implementation.

In your case, an out-of-place transposition is certainly mandatory to get a fast execution.

Note that psi1(:, :, :) = dsqrt(2d0/pi) * psi1(:, :, :) should be memory bound since dsqrt(2d0/pi) is a constant. In the current code, this operation can be done in the parallel loop so to make the operation less memory-bound. With the transposition solution, you can transpose the array here and multiply the items by the constant at the same time.

Is there an intelligent way to somehow rephrase this triply nested do-loop in order to benefit from OpenMP as much as possible?

Besides the transposition and using one big parallel region as proposed in the comments, I do not think there is much more to do. In the end, this operation should be vectorized using SIMD instructions and it should be memory-bound. The loop should be able to nearly saturate the RAM if everything is fine.

If you run this code on a NUMA machine, then the distribution of the psi0, psi1 arrays may be an issue to reach the optimal memory throughput (not much wrk2). On most system, the OS map the virtual memory page to the memory of the NUMA node doing the first touch. Thus, if you use a platform with multiple NUMA nodes, then you need to initialize the two array in parallel following the same parallelization method so the pages can be efficiently distributed during the next computation. An alternative solution is to mitigate the NUMA effects by mapping pages randomly on the available NUMA nodes using numactl. The later may causes unwanted performance issue though.

By the way, I am not entirely sure operations like psi0(1:Nb, l, m) * wrk2(1:Nb) are guaranteed not to produce a temporary array. You can use loops so to be sure there is no allocation made and avoid temporary data to be stored and read-back (at the expense of a more verbose code). I strongly advise you to check the assembly code of this loop nest.

Is the simple do-loop over j (the wrk(j) = (-1)**(j-1) * ri(j) * dsqrt( rx(xi(j)) * wi(j) ) instruction) parallelized well? Is there anything to think about there?

The main issue here is rx(xi(j)) which prevent the loop from being vectorized on most platforms. Moreover, regarding the content of xi(j), the access pattern can be pretty random and many cache misses can happen. In the worst case, the cache lines might never be reused. In practice, the array should be small enough for this not to be a huge issue. In fact, with Nb ~ 1000..2000, this should not be much an issue. One can use software prefetching so to help the processor but recent modern processors should be able to do a relatively good job (thanks to the out-of-order execution combined with the instruction-level parallelism).

The loop might be actually too small for multiple thread to be useful as it takes time to share the work to many cores and wait for the result to be computed. I advise you to try to remove the parallel directive and check the resulting performance.

Note you can unroll the loop manually so to avoid the (-1)**(j-1) using something like this. That being said, optimizing compilers generate a good code already and the dsqrt function is typically much more expensive.