I have previously been working with openMP in relatively simple situations. As I'm not that experience yet, I need some advice on how to best use openMP in my below code.
Basically my code numerically integrates two vectors (up and un) forward in time by solving two coupled differential equations using a Runge-Kutta 4th order scheme - i.e. for every step forward in time four temporary steps are made:
#define DUDT_P(STEP) cmplx(0,1)*( d(rng_0,1)*CONJG(un(rng_p1,STEP))*un(rng_p2,STEP) + \ d(rng_0,2)*CONJG(un(rng_m1,STEP))*up(rng_p1,STEP) + \ d(rng_0,3)* un(rng_m2,STEP) *up(rng_m1,STEP) ) #define DUDT_N(STEP) cmplx(0,1)*( d(rng_0,1)*CONJG(up(rng_p1,STEP))*up(rng_p2,STEP) + \ d(rng_0,2)*CONJG(up(rng_m1,STEP))*un(rng_p1,STEP) + \ d(rng_0,3)* up(rng_m2,STEP) *un(rng_m1,STEP) ) [...] do ii = 2, nt+1 do jj = 1,nti ! ---------- STEP 1/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE dup(rng_0,1) = DUDT_P(1) up(:,2) = up(:,1) + dt2*dup(:,1) dun(rng_0,1) = DUDT_N(1) un(:,2) = un(:,1) + dt2*dun(:,1) ! ---------- STEP 2/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE dup(rng_0,2) = DUDT_P(2) up(:,3) = up(:,1) + dt2*dup(:,2) dun(rng_0,2) = DUDT_N(2) un(:,3) = un(:,1) + dt2*dun(:,2) ! ---------- STEP 3/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE dup(rng_0,3) = DUDT_P(3) up(:,4) = up(:,1) + dt*dup(:,3) dun(rng_0,3) = DUDT_N(3) un(:,4) = un(:,1) + dt*dun(:,3) ! ---------- STEP 4/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE dup(rng_0,4) = DUDT_P(4) up(:,1) = up(:,1) + dt6*dup(:,1) + dt3*dup(:,2) + dt3*dup(:,3) + dt6*dup(:,4) ! Full/actual time stepped solution (full RK4 solution) dun(rng_0,4) = DUDT_N(4) un(:,1) = un(:,1) + dt6*dun(:,1) + dt3*dun(:,2) + dt3*dun(:,3) + dt6*dun(:,4) ! Full/actual time stepped solution (full RK4 solution) end do saved(:,ii,1) = up(:,1) ! Save solution saved(:,ii,2) = un(:,1) end do
Now, since each vector is 20 components large, there are a lot of arithmetic operations needed per temporary time step. You can see this by looking at the derivative DUDT_P() and DUDT_N() (macro) functions, one for each vector, and notice that they involve element-wise arithmetic in the rng_* entries between the vectors (here all rng_* lists specify various of the 20 elements in the up and un vectors).
There is one important constraint, which I have also commented in the code. Since the solution at each following temporary time step depends on the previous, the threads MUST have the same updated view on the up and un vectors before calculating the next temp. time step solution (i.e. shared).
For clarity on the variables, this is my used notation:
I have thought of a couple ways to tackle this, however, I'm stuck in both:
Thanks for any comment you may have!
You have a very good ADD/MUL balance. On this kind of CPU-bound code, ifort or PGI may give a better performance than gfortran. Are you using such a compiler?
Have you activated the best compiler options for your CPU (-xAVX
, -xCORE-AVX2
, -xSSE4.2
, etc) ? This could make a lot of difference on your particular example.
If you put explicitly the loop boundaries : do k=1,20
,
the compiler will know what to best for the loops.
You could also play on memory alignment. With the Intel compiler, you
can put the directive:
!DIR$ ATTRIBUTES ALIGN : 32 :: dup, up, dun, un
after the declaration of your variables. It is also possible with the PGI compiler. This will improve the memory access, and the vectorization possibilities of the compiler, as you can only vectorize with data aligned
on a 32 byte boundary (or 64 bytes on Xeon Phi, and 16 bytes on older CPUs).
As your first dimension is 20, all the columns of your array are also 32-byte
aligned. You can tell this to the compiler using the directive !DIR$ VECTOR ALIGNED
before the inner-most loop.
Try -O2
instead of -O3
, as -O3
is sometimes slower.
Try the flush-to-zero compiler option : working with denormalized numbers slows down the execution
If you want to use one thread for up
and one thread for un
, you could try the following,
but I'm really not sure you will gain (or event not lose...) because of the openMP barriers.
Thread 0 is doing up
and thread 1 is doing un
. It is important to put the allocate
statements and initialization inside the openMP sections so that the memory is allocated as close as possible
to the threads (first-touch policy).
Here your code modified with all those ideas:
!DIR$ ATTRIBUTES ALIGN : 32 :: dup, up, dun, un
#define DUDT_P(STEP) cmplx(0,1)*( d(rng_0,1)*CONJG(un(rng_p1,STEP))*un(rng_p2,STEP) + \
d(rng_0,2)*CONJG(un(rng_m1,STEP))*up(rng_p1,STEP) + \
d(rng_0,3)* un(rng_m2,STEP) *up(rng_m1,STEP) )
#define DUDT_N(STEP) cmplx(0,1)*( d(rng_0,1)*CONJG(up(rng_p1,STEP))*up(rng_p2,STEP) + \
d(rng_0,2)*CONJG(up(rng_m1,STEP))*un(rng_p1,STEP) + \
d(rng_0,3)* up(rng_m2,STEP) *un(rng_m1,STEP) )
[...]
!$OMP PARALLEL SHARED( up,un,dup,dun, [...] ) PRIVATE( [...] ) num_threads(2)
ithread = omp_get_thread_num()
!DIR$ ATTRIBUTES ALIGN : 32 :: dup, up, dun, un
if (ithread == 0) then
allocate (up(1:20,4), dup(1:20,4))
! Initialization is important to pin the memory to the cores
up = 0.
dun = 0.
else
allocate (un(1:20,4), dun(1:20,4))
! Initialization is important to pin the memory to the cores
un = 0.
dun = 0.
endif
do ii = 2, nt+1
do jj = 1,nti
!$OMP BARRIER
! ---------- STEP 1/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE
if (ithread == 0) then
dup(rng_0,1) = DUDT_P(1)
!DIR$ VECTOR ALIGNED
do k=1,20
up(k,2) = up(k,1) + dt2*dup(k,1)
end do
else
dun(rng_0,1) = DUDT_N(1)
!DIR$ VECTOR ALIGNED
do k=1,20
un(k,2) = un(k,1) + dt2*dun(k,1)
end do
endif
!$OMP BARRIER
! ---------- STEP 2/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE
if (ithread == 0) then
dup(rng_0,2) = DUDT_P(2)
!DIR$ VECTOR ALIGNED
do k=1,20
up(k,3) = up(k,1) + dt2*dup(k,2)
end do
else
dun(rng_0,2) = DUDT_N(2)
!DIR$ VECTOR ALIGNED
do k=1,20
un(k,3) = un(k,1) + dt2*dun(k,2)
end do
endif
!$OMP BARRIER
! ---------- STEP 3/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE
if (ithread == 0) then
dup(rng_0,3) = DUDT_P(3)
!DIR$ VECTOR ALIGNED
do k=1,20
up(k,4) = up(k,1) + dt*dup(k,3)
end do
else
dun(rng_0,3) = DUDT_N(3)
!DIR$ VECTOR ALIGNED
do k=1,20
un(k,4) = un(k,1) + dt*dun(k,3)
end do
endif
!$OMP BARRIER
! ---------- STEP 4/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE
if (ithread == 0) then
dup(rng_0,4) = DUDT_P(4)
!DIR$ VECTOR ALIGNED
do k=1,20
up(k,1) = up(k,1) + dt6*dup(k,1) + dt3*dup(k,2) + dt3*dup(k,3) + dt6*dup(k,4) ! Full/actual time stepped solution (full RK4 solution)
enddo
else
dun(rng_0,4) = DUDT_N(4)
!DIR$ VECTOR ALIGNED
do k=1,20
un(k,1) = un(k,1) + dt6*dun(k,1) + dt3*dun(k,2) + dt3*dun(k,3) + dt6*dun(k,4) ! Full/actual time stepped solution (full RK4 solution)
enddo
endif
end do
if (ithread == 0) then
do k=1,20
saved(k,ii,1) = up(k,1) ! Save solution
end do
else
do k=1,20
saved(k,ii,2) = un(k,1) ! Save solution
end do
end if
end do
if (ithread == 0) then
deallocate(un,up,dun,dup)
end if
!$OMP END PARALLEL