Advice on openMP implementation

I have previously been working with openMP in relatively simple situations. As I'm not that experience yet, I need some advice on how to best use openMP in my below code.

Basically my code numerically integrates two vectors (up and un) forward in time by solving two coupled differential equations using a Runge-Kutta 4th order scheme - i.e. for every step forward in time four temporary steps are made:

#define DUDT_P(STEP)     cmplx(0,1)*(   d(rng_0,1)*CONJG(un(rng_p1,STEP))*un(rng_p2,STEP) + \
                                        d(rng_0,2)*CONJG(un(rng_m1,STEP))*up(rng_p1,STEP) + \
                                        d(rng_0,3)*      un(rng_m2,STEP) *up(rng_m1,STEP) )

#define DUDT_N(STEP)     cmplx(0,1)*(   d(rng_0,1)*CONJG(up(rng_p1,STEP))*up(rng_p2,STEP) + \
                                        d(rng_0,2)*CONJG(up(rng_m1,STEP))*un(rng_p1,STEP) + \
                                        d(rng_0,3)*      up(rng_m2,STEP) *un(rng_m1,STEP) )

[...]

do ii = 2, nt+1 
        do jj = 1,nti

                ! ---------- STEP 1/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE
                dup(rng_0,1) = DUDT_P(1)
                up(:,2) = up(:,1) + dt2*dup(:,1)  

                dun(rng_0,1) = DUDT_N(1) 
                un(:,2) = un(:,1) + dt2*dun(:,1)

                ! ---------- STEP 2/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE
                dup(rng_0,2) = DUDT_P(2)
                up(:,3) = up(:,1) + dt2*dup(:,2)

                dun(rng_0,2) = DUDT_N(2)
                un(:,3) = un(:,1) + dt2*dun(:,2)

                ! ---------- STEP 3/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE
                dup(rng_0,3) = DUDT_P(3)
                up(:,4) = up(:,1) + dt*dup(:,3)

                dun(rng_0,3) = DUDT_N(3)
                un(:,4) = un(:,1) + dt*dun(:,3)

                ! ---------- STEP 4/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE
                dup(rng_0,4) = DUDT_P(4)
                up(:,1) = up(:,1) + dt6*dup(:,1) + dt3*dup(:,2) + dt3*dup(:,3) + dt6*dup(:,4) ! Full/actual time stepped solution (full RK4 solution)

                dun(rng_0,4) = DUDT_N(4)
                un(:,1) = un(:,1) + dt6*dun(:,1) + dt3*dun(:,2) + dt3*dun(:,3) + dt6*dun(:,4) ! Full/actual time stepped solution (full RK4 solution)

        end do

        saved(:,ii,1) = up(:,1) ! Save solution
        saved(:,ii,2) = un(:,1)
end do

Now, since each vector is 20 components large, there are a lot of arithmetic operations needed per temporary time step. You can see this by looking at the derivative DUDT_P() and DUDT_N() (macro) functions, one for each vector, and notice that they involve element-wise arithmetic in the rng_* entries between the vectors (here all rng_* lists specify various of the 20 elements in the up and un vectors).

There is one important constraint, which I have also commented in the code. Since the solution at each following temporary time step depends on the previous, the threads MUST have the same updated view on the up and un vectors before calculating the next temp. time step solution (i.e. shared).

For clarity on the variables, this is my used notation:

d(:,:) are constant (parameter) interaction coefs.
up(1:20,1) and un(1:20,1) are the presently known actual solutions
up(1:20,2) and un(1:20,2) are the first temp. time step solutions
up(1:20,3) and un(1:20,3) are the second temp. time step solutions
up(1:20,4) and un(1:20,4) are the third temp. time step solutions
The structure of the real and temp. solutions need not necessarily be in the up(1:20,1:4) format I have chosen here.
The memory view on the dup and dun rate of change variables must be consistent at the point during step 4/4, where they are used to calculate the final/real time step solution.

I have thought of a couple ways to tackle this, however, I'm stuck in both:

At each of the four temp. steps, one thread could calculate DUDT_P() and the other DUDT_N() by making two sections for each temp step. However, when trying this there is a small performance loss because, I'm guessing, that dup and dun (like up and un) need to be shared vars since there is no way of specifying which thread is to execute which of the two sections? I.e. the cached dup and dun become invalid unless the same thread can somehow be forced to only do the calculations at each temp step for the same vector.
Somehow establish a work-sharing situation or using a reduction clause at each temp step, however, I have difficulty understanding how to set this up.

Thanks for any comment you may have!

Solution

Single-core optimization

You have a very good ADD/MUL balance. On this kind of CPU-bound code, ifort or PGI may give a better performance than gfortran. Are you using such a compiler? Have you activated the best compiler options for your CPU (-xAVX, -xCORE-AVX2, -xSSE4.2, etc) ? This could make a lot of difference on your particular example.
If you put explicitly the loop boundaries : do k=1,20, the compiler will know what to best for the loops.
You could also play on memory alignment. With the Intel compiler, you can put the directive: !DIR$ ATTRIBUTES ALIGN : 32 :: dup, up, dun, un after the declaration of your variables. It is also possible with the PGI compiler. This will improve the memory access, and the vectorization possibilities of the compiler, as you can only vectorize with data aligned on a 32 byte boundary (or 64 bytes on Xeon Phi, and 16 bytes on older CPUs).
As your first dimension is 20, all the columns of your array are also 32-byte aligned. You can tell this to the compiler using the directive !DIR$ VECTOR ALIGNED before the inner-most loop.
Try -O2 instead of -O3, as -O3 is sometimes slower.
Try the flush-to-zero compiler option : working with denormalized numbers slows down the execution

OpenMP

If you want to use one thread for up and one thread for un, you could try the following, but I'm really not sure you will gain (or event not lose...) because of the openMP barriers.

Thread 0 is doing up and thread 1 is doing un. It is important to put the allocate statements and initialization inside the openMP sections so that the memory is allocated as close as possible to the threads (first-touch policy).

Here your code modified with all those ideas:

!DIR$ ATTRIBUTES ALIGN : 32 :: dup, up, dun, un

#define DUDT_P(STEP)     cmplx(0,1)*(   d(rng_0,1)*CONJG(un(rng_p1,STEP))*un(rng_p2,STEP) + \
                                        d(rng_0,2)*CONJG(un(rng_m1,STEP))*up(rng_p1,STEP) + \
                                        d(rng_0,3)*      un(rng_m2,STEP) *up(rng_m1,STEP) )

#define DUDT_N(STEP)     cmplx(0,1)*(   d(rng_0,1)*CONJG(up(rng_p1,STEP))*up(rng_p2,STEP) + \
                                        d(rng_0,2)*CONJG(up(rng_m1,STEP))*un(rng_p1,STEP) + \
                                        d(rng_0,3)*      up(rng_m2,STEP) *un(rng_m1,STEP) )

[...]

!$OMP PARALLEL SHARED( up,un,dup,dun, [...] ) PRIVATE( [...] ) num_threads(2)
ithread = omp_get_thread_num()

!DIR$ ATTRIBUTES ALIGN : 32 :: dup, up, dun, un
if (ithread == 0) then
  allocate (up(1:20,4), dup(1:20,4))
  ! Initialization is important to pin the memory to the cores
  up = 0.
  dun = 0.
else
  allocate (un(1:20,4), dun(1:20,4))
  ! Initialization is important to pin the memory to the cores
  un = 0.
  dun = 0.
endif

do ii = 2, nt+1
        do jj = 1,nti

            !$OMP BARRIER
                ! ---------- STEP 1/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE
            if (ithread == 0) then
                dup(rng_0,1) = DUDT_P(1)
                !DIR$ VECTOR ALIGNED
                do k=1,20
                  up(k,2) = up(k,1) + dt2*dup(k,1)
                end do
            else
                dun(rng_0,1) = DUDT_N(1)
                !DIR$ VECTOR ALIGNED
                do k=1,20
                  un(k,2) = un(k,1) + dt2*dun(k,1)
                end do
            endif

            !$OMP BARRIER
                ! ---------- STEP 2/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE
            if (ithread == 0) then
                dup(rng_0,2) = DUDT_P(2)
                !DIR$ VECTOR ALIGNED
                do k=1,20
                  up(k,3) = up(k,1) + dt2*dup(k,2)
                end do
            else
                dun(rng_0,2) = DUDT_N(2)
                !DIR$ VECTOR ALIGNED
                do k=1,20
                  un(k,3) = un(k,1) + dt2*dun(k,2)
                end do
            endif

            !$OMP BARRIER
                ! ---------- STEP 3/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE
            if (ithread == 0) then
                dup(rng_0,3) = DUDT_P(3)
                !DIR$ VECTOR ALIGNED
                do k=1,20
                  up(k,4) = up(k,1) + dt*dup(k,3)
                end do
            else
                dun(rng_0,3) = DUDT_N(3)
                !DIR$ VECTOR ALIGNED
                do k=1,20
                  un(k,4) = un(k,1) + dt*dun(k,3)
                end do
            endif

            !$OMP BARRIER
                ! ---------- STEP 4/4 ---------- THERE MUST BE A BARRIER/CONSISTENT MEMORY VIEW HERE
            if (ithread == 0) then
                dup(rng_0,4) = DUDT_P(4)
                !DIR$ VECTOR ALIGNED
                do k=1,20
                  up(k,1) = up(k,1) + dt6*dup(k,1) + dt3*dup(k,2) + dt3*dup(k,3) + dt6*dup(k,4) ! Full/actual time stepped solution (full RK4 solution)
                enddo
            else
                dun(rng_0,4) = DUDT_N(4)
                !DIR$ VECTOR ALIGNED
                do k=1,20
                  un(k,1) = un(k,1) + dt6*dun(k,1) + dt3*dun(k,2) + dt3*dun(k,3) + dt6*dun(k,4) ! Full/actual time stepped solution (full RK4 solution)
                enddo
            endif

        end do

        if (ithread == 0) then

          do k=1,20
            saved(k,ii,1) = up(k,1) ! Save solution
          end do

        else

          do k=1,20
            saved(k,ii,2) = un(k,1) ! Save solution
          end do

        end if

end do

if (ithread == 0) then
  deallocate(un,up,dun,dup)
end if

!$OMP END PARALLEL