loops parallel-processing fortran openmp

Optimization of OpenMP parallel do loop in Fortran

Background

I am simulating the motion of N charged particles in molecular dynamics with Fortran90 and OpenMP. The analytical expression of forces applied to each ion i is known and is a function of the position of the ion i and the other ions (r_x,r_y,r_z). I compute the Coulomb interaction between each pair of ion using a parallelised 2-nested do loop. I can determine the acceleration (a2_x,a2_y,a2_z) of each ion at the end of the loop (then update velocity and position with velocity-Verlet).

Method

I use the following code in my program to compute the Coulomb forces applied to each ion. I compute the acceleration (a2_x) at the next time step, starting from the position (r_x) at the current time step. It is a 3D problem, I put all the lines but most of them are just same thing for x, y and z so at first read you can just consider the _x variables to see how this works.

I parallelize my loop over C threads, ia and ib are arrays used to split the N ions into C parts. For instance for C=4 threads and N=16 ions (Se edit remarks below)

integer, parameter :: ia(C) = [1,5,9,13]
integer, parameter :: ib(C) = [4,8,12,16]

Then Coulomb is computed as follows

!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp shared(a2_x, a2_y, a2_z)
    im = omp_get_thread_num() + 1 ! How much threads
  ! Coulomb forces between each ion pair
  ! Compute the Coulomb force applied to ion i
    do i = ia(im,1), ib(im,1) ! loop over threads
        do j = 1, N ! loop over all ions
            rji(1)  = r_x(j) - r_x(i) ! distance between the ion i and j over x
            rji(2)  = r_y(j) - r_y(i) ! over y
            rji(3)  = r_z(j) - r_z(i) ! over z
          ! then compute the inverse square root of distance between the current ion i and the neighbor j
            r2inv   = 1.d0/dsqrt(rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
            r2inv   = r2inv * r2inv * r2inv * alpha(1) ! alpha is 1/4.pi.eps0
          ! computation of the accelerations
            a2_x(i) = a2_x(i) - rji(1)*r2inv
            a2_y(i) = a2_y(i) - rji(2)*r2inv
            a2_z(i) = a2_z(i) - rji(3)*r2inv
        enddo
    enddo
    !$omp end parallel

Problematics

I am trying to optimize this time consuming part of my program. The number of operation is quite high, scales quickly with N. Can you tell me your opinion on this program ? I have some specific questions.

I have been told I should have the positions r_x, r_y and r_z as private variables, which seems counter-intuitive to me because I want to enter that loop using the previously defined positions of the ions, so i use firstprivate. Is that right ?
I am not sure that the parallelisation is optimal regarding the other variables. Shouldn't rji and r2inv be shared ? Because to compute the distance between ions i and j, I go "beyond" threads, you see what I mean ? I need info between ions spread over two different threads.
Is the way I split the ions in the first do optimal ?
I loop over all ions respectively for each ion, which will induce a division by zero when the distance between ion i and i is computed. To prevent this I have a softening variable defined at very small value so it is not exactly zero. I do this to avoid an if i==i that would be time consuming.
Also the square root is maybe also time consuming ?

For any additional detail feel free to ask.

Edit (Remarks)

My computer have a 10 core CPU Xeon W2155, 32 Go RAM. I intend to render around 1000 ions, while thinking about 4000, which requires a lot of time.
I have this Coulomb subroutine among other subroutine that may consume some CPU time. For instance one routine that may be time consuming is devoted to generating random numbers for each ion depending they are already excited or not, and applying the correct effect whether they absorb or not a photon. So that is a lot of RNG and if for each ion.

Edit (Test of the propositions)

Using !$omp do in combination with schedule(dynamic,1), or schedule(guided) or schedule(nonmonotonic:dynamic) and/or collapse(2) did not improve the run time. It made it at least three time longer. It is suggested the number of element in my simulations (N) is too low to see a significant improve. If I ever try to render much higher number of elements (4096, 8192 ...) I will try those options.
Using !$omp do rather than a home made ion distribution among cores did show equivalent in term of run time. It is easier to implement I will keep this.
Replacing the inverse dsqrt by **(-1/2) showed to be equivalent in term of run time.
Delaying the square root and combining it with the third power of r2inv was also equivalent. So I replace the whole series of operation by **(-1.5).
Same idea with rji(1)*r2inv, I do rji*r2inv before and only use the result in the next lines.

Edit 2 (Test with !$omp reduction(+:...))

I have tested the program with the following instructions

original which is the program I present in my question.
!$omp do schedule(dynamic,1)
!$omp reduction(+:a2_x,a2_y,a2_z) with `schedule(dynamic,1)'.
!$omp reduction(+:a2_x,a2_y,a2_z) with schedule(guided) and do i = 2, N do j = 1, i-1 for the loop (half work).

for 1024 and 16384 ions. Turns out my original version is still faster for me but the reduction version is not as much "catastrophic" as the previous test without reduction.

Version	N = 1024	N = 16384
original	84 s	15194 s
`schedule(dynamic,1)`	1300 s	not tested
`reduction` and `schedule(dynamic,1)`	123 s	24860 s
`reduction` and `schedule(guided)` (2)	121 s	24786 s

What is weird is that @PierU still has a faster computation with reduction, while for me it is not optimal. Where should such difference come from ?

Hypothesis

The fact I have a 10 core make the workload so lighter on each core for a given number of ion ?
I use double precision, maybe single precision are faster ?
Do you have AVX-512 instruction set ? It has a specific hardware to compute inverse square root much faster (see this article).
The bottleneck is elsewhere in my program. I am aware I should only test the Coulomb part. But I wanted to test it in the context of my program, see if it really shorten computation time. I have a section with a lot of where and rng perhaps I should work on this.

Solution

Generally speaking, the variables that you just need to read in the parallel region can be shared. However, having firstprivate copies for each threads can give better performances in some cases (the copies can be in the local cache of each core), particularly for variables that are repeatedly read.
definitely not! If you do that, there will be a race condition on these variables
looks ok, but it is generally simpler (and at worst as efficient) to use an !$OMP DO directive instead of manually distributing the work to the different threads

    !$OMP DO
    do i = 1, N ! loop over all ions
        do j = 1, N ! loop over all ions

why not, provided that you are able to choose a softening value that doesn't alter your simulation (this is something that you have to test against the if solution)
it is somehow, but at some point you cannot avoid an exponentiation. I would delay the sqrt and the division like this:

r2inv   = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
r2inv   = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0

Dividing the work by 2

The forces are symmetrical, and can be computed only once for a given (i,j) pair. This also naturally avoids the i==j case and the softening value. A reduction clause is needed on the a2* arrays as the same elements can be updated by different threads. The workload between iterations is highly unbalanced, though, and a dynamic clause is needed. This is actually a case were manually distributing the iterations to the threads can be more efficient ;) ...

!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp reduction(+:a2_x, a2_y, a2_z)
  ! Coulomb forces between each ion pair
  ! Compute the Coulomb force applied to ion i
    !$omp do schedule(dynamic,1)
    do i = 1, N-1 ! loop over all ions
        do j = i+1, N ! loop over some ions
            rji(1)  = r_x(j) - r_x(i) ! distance between the ion i and j over x
            rji(2)  = r_y(j) - r_y(i) ! over y
            rji(3)  = r_z(j) - r_z(i) ! over z
          ! then compute the inverse square root of distance between the current ion i and the neighbor j
            r2inv   = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3))
            r2inv   = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0
          ! computation of the accelerations
            rji(:) = rji(:)*r2inv
            a2_x(i) = a2_x(i) - rji(1)
            a2_y(i) = a2_y(i) - rji(2)
            a2_z(i) = a2_z(i) - rji(3)
            a2_x(j) = a2_x(j) + rji(1)
            a2_y(j) = a2_y(j) + rji(2)
            a2_z(j) = a2_z(j) + rji(3)
        enddo
    enddo
    !$omp end do
    !$omp end parallel

Alternatively, a guided clause could be used, with some changes in the iterations to have the low workloads in the first ones:

    !$omp do schedule(guided)
    do i = 2, N ! loop over all ions
        do j = 1, i-1 ! loop over some ions

TIMING

I have timed the latter code (divided by 2) on a old core i5 from 2011 (4 cores). Code compiled with gfortran 12.

No OpenMP / OpenMP with 1 thread / 4 threads no explicit schedule (that is static by default) / schedule(dynamic) / schedule(nonmonotonic:dynamic) / schedule(guided). guided timed with 2 code versions : (1) with do i=1,N-1; do j=i+1,N, (2) with do i=2,N; do j=1,i-1

	N=256	N=1204	N=4096	N=16384	N=65536
no omp	0.0016	0.026	0.41	6.9	116
1 thread	0.0019	0.027	0.48	8.0	118
4 threads	0.0014	0.013	0.20	3.4	55
dynamic	0.0018	0.020	0.29	5.3	84
nonmonotonic			0.29	5.2	85
guided (1)	0.0014	0.013	0.21	3.7	61
guided (2)	0.0009	0.093	0.13	2.2	38

The guided schedule with low workload iterations first wins. And I have some speed-up even for low values of N. It's important to note however that the behavior can differ on a different CPU, and with a different compiler.

I have also timed the code with do i=1,N; do j=1,N (as the work is balanced between iterations there's no need of sophisticated schedule clauses):

	N=256	N=1204	N=4096	N=16384	N=65536
no omp	0.0028	0.047	0.72	11.5	183
4 threads	0.0013	0.019	0.25	4.0	71