multithreading fortran openmp benchmarking

Code takes much more time to finish with more than 1 thread

I want to benchmark some Fortran code with OpenMP-threads with a critical-section. To simulate a realistic environment I tried to generate some load before this critical-section.

!Kompileraufruf: gfortran -fopenmp -o minExample.x minExample.f90

  PROGRAM minExample
     USE omp_lib
     IMPLICIT NONE
     INTEGER                        :: n_chars, real_alloced
     INTEGER                        :: nx,ny,nz,ix,iy,iz, idx
     INTEGER                        :: nthreads, lasteinstellung,i 
     INTEGER, PARAMETER             :: dp = kind(1.0d0)
     REAL (KIND = dp)               :: j
     CHARACTER(LEN=32)              :: arg

     nx             = 2
     ny             = 2
     nz             = 2
     lasteinstellung= 10000
     CALL getarg(1, arg)
     READ(arg,*) nthreads
     CALL OMP_SET_NUM_THREADS(nthreads)
!$omp parallel
!$omp master
     nthreads=omp_get_num_threads()
!$omp end master
!$omp end parallel
     WRITE(*,*) "Running OpenMP benchmark on ",nthreads," thread(s)"

    n_chars = 0
    idx = 0
!$omp parallel do default(none) collapse(3) &
!$omp   shared(nx,ny,nz,n_chars) &
!$omp   private(ix,iy,iz, idx) &
!$omp   private(lasteinstellung,j) !&  
    DO iz=-nz,nz
       DO iy=-ny,ny
          DO ix=-nx,nx
!                  WRITE(*,*) ix,iy,iz
             j = 0.0d0
             DO i=1,lasteinstellung
                j = j + real(i)
             END DO
!$omp critical
             n_chars = n_chars + 1               
            idx = n_chars                       
!$omp end critical
          END DO
       END DO
    END DO
  END PROGRAM

I compiled this code with gfortran -fopenmp -o test.x test.f90 and executed it with time ./test.x THREAD Executing this code gives some strange behaviour depending on the thread-count (set with OMP_SET_NUM_THREADS): compared with one thread (6ms) the execution with more threads costs a lot more time (2 threads: 16000ms, 4 threads: 9000ms) on my multicore machine. What could cause this behaviour? Is there a better (but still easy) way to generate load without running in some cache-effects or related things?

edit: strange behaviour: if I have the write in the nested loops, the execution speeds dramatically up with 2 threads. If its commented out, the execution with 2 or 3 threads takes forever (write shows very slow incrementation of loop variables)...but not with 1 or 4 threads. I tried this code also on another multicore machine. There it takes for 1 and 3 threads forever but not for 2 or 4 threads.

Solution

If the code you are showing is really complete you are missing definition of loadSet in the parallel section in which it is private. It is undefined and loop

                 DO i=1,loadSet
                    j = j + real(i)
                 END DO

can take a completely arbitrary number of iterations.

If the value is defined somewhere before in the code you do not show you probably want firstprivate instead of private.