Search code examples
multithreadingfortranopenmpbenchmarking

Code takes much more time to finish with more than 1 thread


I want to benchmark some Fortran code with OpenMP-threads with a critical-section. To simulate a realistic environment I tried to generate some load before this critical-section.

!Kompileraufruf: gfortran -fopenmp -o minExample.x minExample.f90

  PROGRAM minExample
     USE omp_lib
     IMPLICIT NONE
     INTEGER                        :: n_chars, real_alloced
     INTEGER                        :: nx,ny,nz,ix,iy,iz, idx
     INTEGER                        :: nthreads, lasteinstellung,i 
     INTEGER, PARAMETER             :: dp = kind(1.0d0)
     REAL (KIND = dp)               :: j
     CHARACTER(LEN=32)              :: arg

     nx             = 2
     ny             = 2
     nz             = 2
     lasteinstellung= 10000
     CALL getarg(1, arg)
     READ(arg,*) nthreads
     CALL OMP_SET_NUM_THREADS(nthreads)
!$omp parallel
!$omp master
     nthreads=omp_get_num_threads()
!$omp end master
!$omp end parallel
     WRITE(*,*) "Running OpenMP benchmark on ",nthreads," thread(s)"

    n_chars = 0
    idx = 0
!$omp parallel do default(none) collapse(3) &
!$omp   shared(nx,ny,nz,n_chars) &
!$omp   private(ix,iy,iz, idx) &
!$omp   private(lasteinstellung,j) !&  
    DO iz=-nz,nz
       DO iy=-ny,ny
          DO ix=-nx,nx
!                  WRITE(*,*) ix,iy,iz
             j = 0.0d0
             DO i=1,lasteinstellung
                j = j + real(i)
             END DO
!$omp critical
             n_chars = n_chars + 1               
            idx = n_chars                       
!$omp end critical
          END DO
       END DO
    END DO
  END PROGRAM

I compiled this code with gfortran -fopenmp -o test.x test.f90 and executed it with time ./test.x THREAD Executing this code gives some strange behaviour depending on the thread-count (set with OMP_SET_NUM_THREADS): compared with one thread (6ms) the execution with more threads costs a lot more time (2 threads: 16000ms, 4 threads: 9000ms) on my multicore machine. What could cause this behaviour? Is there a better (but still easy) way to generate load without running in some cache-effects or related things?

edit: strange behaviour: if I have the write in the nested loops, the execution speeds dramatically up with 2 threads. If its commented out, the execution with 2 or 3 threads takes forever (write shows very slow incrementation of loop variables)...but not with 1 or 4 threads. I tried this code also on another multicore machine. There it takes for 1 and 3 threads forever but not for 2 or 4 threads.


Solution

  • If the code you are showing is really complete you are missing definition of loadSet in the parallel section in which it is private. It is undefined and loop

                     DO i=1,loadSet
                        j = j + real(i)
                     END DO
    

    can take a completely arbitrary number of iterations.

    If the value is defined somewhere before in the code you do not show you probably want firstprivate instead of private.