I want to benchmark some Fortran code with OpenMP-threads with a critical-section. To simulate a realistic environment I tried to generate some load before this critical-section.
!Kompileraufruf: gfortran -fopenmp -o minExample.x minExample.f90
PROGRAM minExample
USE omp_lib
IMPLICIT NONE
INTEGER :: n_chars, real_alloced
INTEGER :: nx,ny,nz,ix,iy,iz, idx
INTEGER :: nthreads, lasteinstellung,i
INTEGER, PARAMETER :: dp = kind(1.0d0)
REAL (KIND = dp) :: j
CHARACTER(LEN=32) :: arg
nx = 2
ny = 2
nz = 2
lasteinstellung= 10000
CALL getarg(1, arg)
READ(arg,*) nthreads
CALL OMP_SET_NUM_THREADS(nthreads)
!$omp parallel
!$omp master
nthreads=omp_get_num_threads()
!$omp end master
!$omp end parallel
WRITE(*,*) "Running OpenMP benchmark on ",nthreads," thread(s)"
n_chars = 0
idx = 0
!$omp parallel do default(none) collapse(3) &
!$omp shared(nx,ny,nz,n_chars) &
!$omp private(ix,iy,iz, idx) &
!$omp private(lasteinstellung,j) !&
DO iz=-nz,nz
DO iy=-ny,ny
DO ix=-nx,nx
! WRITE(*,*) ix,iy,iz
j = 0.0d0
DO i=1,lasteinstellung
j = j + real(i)
END DO
!$omp critical
n_chars = n_chars + 1
idx = n_chars
!$omp end critical
END DO
END DO
END DO
END PROGRAM
I compiled this code with gfortran -fopenmp -o test.x test.f90
and executed it with time ./test.x THREAD
Executing this code gives some strange behaviour depending on the thread-count (set with OMP_SET_NUM_THREADS): compared with one thread (6ms) the execution with more threads costs a lot more time (2 threads: 16000ms, 4 threads: 9000ms) on my multicore machine.
What could cause this behaviour? Is there a better (but still easy) way to generate load without running in some cache-effects or related things?
edit: strange behaviour: if I have the write in the nested loops, the execution speeds dramatically up with 2 threads. If its commented out, the execution with 2 or 3 threads takes forever (write shows very slow incrementation of loop variables)...but not with 1 or 4 threads. I tried this code also on another multicore machine. There it takes for 1 and 3 threads forever but not for 2 or 4 threads.
If the code you are showing is really complete you are missing definition of loadSet
in the parallel section in which it is private
. It is undefined and loop
DO i=1,loadSet
j = j + real(i)
END DO
can take a completely arbitrary number of iterations.
If the value is defined somewhere before in the code you do not show you probably want firstprivate
instead of private
.