I am trying to use the FFTW with openMP and Fortran, but I get wrong results when executing in parallel, which also change their values every execution step, displaying typical behaviour when parallelisation goes wrong.
I am aiming for a simple 3d real-to-complex transformation. Following the FFTW tutorial, I took all but the call to dfftw_execute_dft_r2c()
out of the parallel region, but it doesn't seem to work.
I use FFTW 3.3.8, configured with ./configure --enable-threads --enable-openmp --enable-mpi
and compile my code with gfortran program.f03 -o program.o -I/usr/include -fopenmp -lfftw3_omp -lfftw3 -g -Wall
.
This is how my program looks like:
program use_fftw
use,intrinsic :: iso_c_binding
use omp_lib
implicit none
include 'fftw3.f03'
integer, parameter :: dp=kind(1.0d0)
integer, parameter :: Nx = 10
integer, parameter :: Ny = 5
integer, parameter :: Nz = 5
real(dp), parameter :: pi = 3.1415926d0
real(dp), parameter :: physical_length_x = 20.d0
real(dp), parameter :: physical_length_y = 10.d0
real(dp), parameter :: physical_length_z = 10.d0
real(dp), parameter :: lambda1 = 0.5d0
real(dp), parameter :: lambda2 = 0.7d0
real(dp), parameter :: lambda3 = 0.9d0
real(dp), parameter :: dx = physical_length_x/real(Nx,dp)
real(dp), parameter :: dy = physical_length_y/real(Ny,dp)
real(dp), parameter :: dz = physical_length_z/real(Nz,dp)
integer :: void, nthreads
integer :: i, j, k
real(dp):: d
complex(dp), allocatable, dimension(:,:,:) :: arr_out
real(dp), allocatable, dimension(:,:,:) :: arr_in
integer*8 :: plan_forward
allocate(arr_in( 1:Nx, 1:Ny, 1:Nz)); arr_in = 0
allocate(arr_out(1:Nx/2+1, 1:Ny, 1:Nz)); arr_out = 0
!------------------------------
! Initialize fftw stuff
!------------------------------
! Call before any FFTW routine is called outside of parallel region
void = fftw_init_threads()
if (void==0) then
write(*,*) "Error in fftw_init_threads, quitting"
stop
endif
nthreads = omp_get_num_threads()
call fftw_plan_with_nthreads(nthreads)
! plan execution is thread-safe, but plan creation and destruction are not:
! you should create/destroy plans only from a single thread
call dfftw_plan_dft_r2c_3d(plan_forward, Nx, Ny, Nz, arr_in, arr_out, FFTW_ESTIMATE)
!--------------------------------
! Start parallel region
!--------------------------------
!$OMP PARALLEL PRIVATE( j, k, d)
! Fill array with wave
! NOTE: wave only depends on x so you can plot it later.
!$OMP DO
do i = 1, Nx
d = 2.0*pi*i*dx
do j = 1, Ny
do k = 1, Nz
arr_in(i,j,k) = cos(d/lambda1)+sin(d/lambda2)+cos(d/lambda3)
enddo
enddo
enddo
!$OMP END DO
call dfftw_execute_dft_r2c(plan_forward, arr_in, arr_out)
!$OMP END PARALLEL
!-----------------
! print results
!-----------------
do i=1, Nx/2+1
do j=1, Ny
do k=1, Nz
write(*,'(F12.6,A3,F12.6,A3)',advance='no') real(arr_out(i,j,k)), " , ", aimag(arr_out(i,j,k)), " ||"
enddo
write(*,*)
enddo
write(*,*)
enddo
deallocate(arr_in, arr_out)
! destroy plans is not thread-safe; do only with single
call dfftw_destroy_plan(plan_forward)
end program use_fftw
I also tried moving the initialisation part of FFTW (void = fftw_init_threads(); call fftw_plan_with_nthreads(nthreads); call dfftw_plan_dft_r2c_3d(...)
into the parallel region, using a !$OMP SINGLE
block and synchronising with a barrier afterwards, but the situation didn't improve.
Can anyone help me?
EDIT: I was able to test my program on another system, the problem remains. So the issue apparently isn't in my implementation of openmp or FFTW, but somewhere in the program itself.
You should normally call fftw execute routines outside of the parallel
region. They have their own parallel regions inside them and they will take care of running the transform in parallel with that many threads as you requested during planning. They will re-use your existing OpenMP threads.
You can also call them inside a parallel region, but on different arrays, not on the same arrays! And then your plan should be planned to use 1 thread. Each thread would then preform a 2D transform of a slice of the array, for example.
The thread-safety means you can call the routines concurrently, but each for different data.