I am simulating the motion of N charged particles in molecular dynamics with Fortran90 and OpenMP. The analytical expression of forces applied to each ion i is known and is a function of the position of the ion i and the other ions (r_x
,r_y
,r_z
). I compute the Coulomb interaction between each pair of ion using a parallelised 2-nested do loop. I can determine the acceleration (a2_x
,a2_y
,a2_z
) of each ion at the end of the loop (then update velocity and position with velocity-Verlet).
I use the following code in my program to compute the Coulomb forces applied to each ion. I compute the acceleration (a2_x
) at the next time step, starting from the position (r_x
) at the current time step. It is a 3D problem, I put all the lines but most of them are just same thing for x, y and z so at first read you can just consider the _x
variables to see how this works.
I parallelize my loop over C threads, ia and ib are arrays used to split the N ions into C parts. For instance for C=4
threads and N=16
ions (Se edit remarks below)
integer, parameter :: ia(C) = [1,5,9,13]
integer, parameter :: ib(C) = [4,8,12,16]
Then Coulomb is computed as follows
!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp shared(a2_x, a2_y, a2_z)
im = omp_get_thread_num() + 1 ! How much threads
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
do i = ia(im,1), ib(im,1) ! loop over threads
do j = 1, N ! loop over all ions
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = 1.d0/dsqrt(rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
r2inv = r2inv * r2inv * r2inv * alpha(1) ! alpha is 1/4.pi.eps0
! computation of the accelerations
a2_x(i) = a2_x(i) - rji(1)*r2inv
a2_y(i) = a2_y(i) - rji(2)*r2inv
a2_z(i) = a2_z(i) - rji(3)*r2inv
enddo
enddo
!$omp end parallel
I am trying to optimize this time consuming part of my program. The number of operation is quite high, scales quickly with N. Can you tell me your opinion on this program ? I have some specific questions.
I have been told I should have the positions r_x
, r_y
and r_z
as private
variables, which seems counter-intuitive to me because I want to enter that loop using the previously defined positions of the ions, so i use firstprivate
. Is that right ?
I am not sure that the parallelisation is optimal regarding the other variables. Shouldn't rji and r2inv be shared ? Because to compute the distance between ions i and j, I go "beyond" threads, you see what I mean ? I need info between ions spread over two different threads.
Is the way I split the ions in the first do optimal ?
I loop over all ions respectively for each ion, which will induce a division by zero when the distance between ion i and i is computed. To prevent this I have a softening variable defined at very small value so it is not exactly zero. I do this to avoid an if i==i that would be time consuming.
Also the square root is maybe also time consuming ?
For any additional detail feel free to ask.
My computer have a 10 core CPU Xeon W2155, 32 Go RAM. I intend to render around 1000 ions, while thinking about 4000, which requires a lot of time.
I have this Coulomb subroutine among other subroutine that may consume some CPU time. For instance one routine that may be time consuming is devoted to generating random numbers for each ion depending they are already excited or not, and applying the correct effect whether they absorb or not a photon. So that is a lot of RNG and if for each ion.
Using !$omp do
in combination with schedule(dynamic,1)
, or schedule(guided)
or schedule(nonmonotonic:dynamic)
and/or collapse(2)
did not improve the run time. It made it at least three time longer. It is suggested the number of element in my simulations (N) is too low to see a significant improve. If I ever try to render much higher number of elements (4096, 8192 ...) I will try those options.
Using !$omp do
rather than a home made ion distribution among cores did show equivalent in term of run time. It is easier to implement I will keep this.
Replacing the inverse dsqrt
by **(-1/2)
showed to be equivalent in term of run time.
Delaying the square root and combining it with the third power of r2inv
was also equivalent. So I replace the whole series of operation by **(-1.5)
.
Same idea with rji(1)*r2inv
, I do rji*r2inv
before and only use the result in the next lines.
I have tested the program with the following instructions
original
which is the program I present in my question.!$omp do schedule(dynamic,1)
!$omp reduction(+:a2_x,a2_y,a2_z)
with `schedule(dynamic,1)'.!$omp reduction(+:a2_x,a2_y,a2_z)
with schedule(guided)
and do i = 2, N do j = 1, i-1
for the loop (half work).for 1024 and 16384 ions. Turns out my original version is still faster for me but the reduction
version is not as much "catastrophic" as the previous test without reduction
.
Version | N = 1024 | N = 16384 |
---|---|---|
original | 84 s | 15194 s |
schedule(dynamic,1) |
1300 s | not tested |
reduction and schedule(dynamic,1) |
123 s | 24860 s |
reduction and schedule(guided) (2) |
121 s | 24786 s |
What is weird is that @PierU still has a faster computation with reduction
, while for me it is not optimal. Where should such difference come from ?
where
and rng perhaps I should work on this.shared
. However, having firstprivate
copies for each threads can give better performances in some cases (the copies can be in the local cache of each core), particularly for variables that are repeatedly read.!$OMP DO
directive instead of manually distributing the work to the different threads !$OMP DO
do i = 1, N ! loop over all ions
do j = 1, N ! loop over all ions
softening
value that doesn't alter your simulation (this is something that you have to test against the if
solution)sqrt
and the division like this:r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0
The forces are symmetrical, and can be computed only once for a given (i,j) pair. This also naturally avoids the i==j case and the softening value. A reduction clause is needed on the a2*
arrays as the same elements can be updated by different threads. The workload between iterations is highly unbalanced, though, and a dynamic
clause is needed. This is actually a case were manually distributing the iterations to the threads can be more efficient ;) ...
!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp reduction(+:a2_x, a2_y, a2_z)
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
!$omp do schedule(dynamic,1)
do i = 1, N-1 ! loop over all ions
do j = i+1, N ! loop over some ions
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3))
r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0
! computation of the accelerations
rji(:) = rji(:)*r2inv
a2_x(i) = a2_x(i) - rji(1)
a2_y(i) = a2_y(i) - rji(2)
a2_z(i) = a2_z(i) - rji(3)
a2_x(j) = a2_x(j) + rji(1)
a2_y(j) = a2_y(j) + rji(2)
a2_z(j) = a2_z(j) + rji(3)
enddo
enddo
!$omp end do
!$omp end parallel
Alternatively, a guided
clause could be used, with some changes in the iterations to have the low workloads in the first ones:
!$omp do schedule(guided)
do i = 2, N ! loop over all ions
do j = 1, i-1 ! loop over some ions
I have timed the latter code (divided by 2) on a old core i5 from 2011 (4 cores). Code compiled with gfortran 12.
No OpenMP / OpenMP with 1 thread / 4 threads no explicit schedule (that is static by default) / schedule(dynamic) / schedule(nonmonotonic:dynamic) / schedule(guided). guided
timed with 2 code versions : (1) with do i=1,N-1; do j=i+1,N
, (2) with do i=2,N; do j=1,i-1
N=256 | N=1204 | N=4096 | N=16384 | N=65536 | |
---|---|---|---|---|---|
no omp | 0.0016 | 0.026 | 0.41 | 6.9 | 116 |
1 thread | 0.0019 | 0.027 | 0.48 | 8.0 | 118 |
4 threads | 0.0014 | 0.013 | 0.20 | 3.4 | 55 |
dynamic | 0.0018 | 0.020 | 0.29 | 5.3 | 84 |
nonmonotonic | 0.29 | 5.2 | 85 | ||
guided (1) | 0.0014 | 0.013 | 0.21 | 3.7 | 61 |
guided (2) | 0.0009 | 0.093 | 0.13 | 2.2 | 38 |
The guided
schedule with low workload iterations first wins. And I have some speed-up even for low values of N. It's important to note however that the behavior can differ on a different CPU, and with a different compiler.
I have also timed the code with do i=1,N; do j=1,N
(as the work is balanced between iterations there's no need of sophisticated schedule clauses):
N=256 | N=1204 | N=4096 | N=16384 | N=65536 | |
---|---|---|---|---|---|
no omp | 0.0028 | 0.047 | 0.72 | 11.5 | 183 |
4 threads | 0.0013 | 0.019 | 0.25 | 4.0 | 71 |