OpenMP FFTW with Fortran not thread safe

I am trying to use the FFTW with openMP and Fortran, but I get wrong results when executing in parallel, which also change their values every execution step, displaying typical behaviour when parallelisation goes wrong.

I am aiming for a simple 3d real-to-complex transformation. Following the FFTW tutorial, I took all but the call to dfftw_execute_dft_r2c() out of the parallel region, but it doesn't seem to work.

I use FFTW 3.3.8, configured with ./configure --enable-threads --enable-openmp --enable-mpi and compile my code with gfortran program.f03 -o program.o -I/usr/include -fopenmp -lfftw3_omp -lfftw3 -g -Wall.

This is how my program looks like:

program use_fftw

  use,intrinsic :: iso_c_binding
  use omp_lib
  implicit none
  include 'fftw3.f03'

  integer, parameter     :: dp=kind(1.0d0)
  integer, parameter     :: Nx = 10
  integer, parameter     :: Ny = 5
  integer, parameter     :: Nz = 5
  real(dp), parameter    :: pi = 3.1415926d0
  real(dp), parameter    :: physical_length_x = 20.d0
  real(dp), parameter    :: physical_length_y = 10.d0
  real(dp), parameter    :: physical_length_z = 10.d0
  real(dp), parameter    :: lambda1 = 0.5d0
  real(dp), parameter    :: lambda2 = 0.7d0
  real(dp), parameter    :: lambda3 = 0.9d0
  real(dp), parameter    :: dx = physical_length_x/real(Nx,dp)
  real(dp), parameter    :: dy = physical_length_y/real(Ny,dp)
  real(dp), parameter    :: dz = physical_length_z/real(Nz,dp)

  integer :: void, nthreads

  integer :: i, j, k
  real(dp):: d

  complex(dp), allocatable, dimension(:,:,:)  :: arr_out
  real(dp), allocatable, dimension(:,:,:)     :: arr_in
  integer*8                                   :: plan_forward


  allocate(arr_in( 1:Nx, 1:Ny, 1:Nz));     arr_in = 0
  allocate(arr_out(1:Nx/2+1, 1:Ny, 1:Nz)); arr_out = 0


  !------------------------------
  ! Initialize fftw stuff
  !------------------------------

  ! Call before any FFTW routine is called outside of parallel region
  void = fftw_init_threads()
  if (void==0) then
    write(*,*) "Error in fftw_init_threads, quitting"
    stop
  endif
  nthreads = omp_get_num_threads()
  call fftw_plan_with_nthreads(nthreads)

  ! plan execution is thread-safe, but plan creation and destruction are not:
  ! you should create/destroy plans only from a single thread
  call dfftw_plan_dft_r2c_3d(plan_forward, Nx, Ny, Nz, arr_in, arr_out, FFTW_ESTIMATE)


  !--------------------------------
  ! Start parallel region
  !--------------------------------

  !$OMP PARALLEL PRIVATE( j, k, d)

    ! Fill array with wave
    ! NOTE: wave only depends on x so you can plot it later.
    !$OMP DO
      do i = 1, Nx
        d = 2.0*pi*i*dx
        do j = 1, Ny
          do k = 1, Nz
            arr_in(i,j,k) = cos(d/lambda1)+sin(d/lambda2)+cos(d/lambda3) 
          enddo
        enddo
      enddo
    !$OMP END DO


    call dfftw_execute_dft_r2c(plan_forward, arr_in, arr_out)

  !$OMP END PARALLEL


  !-----------------
  ! print results
  !-----------------

  do i=1, Nx/2+1
    do j=1, Ny
      do k=1, Nz
        write(*,'(F12.6,A3,F12.6,A3)',advance='no') real(arr_out(i,j,k)), " , ", aimag(arr_out(i,j,k)), " ||"
      enddo
      write(*,*)
    enddo
    write(*,*)
  enddo



  deallocate(arr_in, arr_out)
  ! destroy plans is not thread-safe; do only with single
  call dfftw_destroy_plan(plan_forward)

end program use_fftw

I also tried moving the initialisation part of FFTW (void = fftw_init_threads(); call fftw_plan_with_nthreads(nthreads); call dfftw_plan_dft_r2c_3d(...) into the parallel region, using a !$OMP SINGLE block and synchronising with a barrier afterwards, but the situation didn't improve.

Can anyone help me?

EDIT: I was able to test my program on another system, the problem remains. So the issue apparently isn't in my implementation of openmp or FFTW, but somewhere in the program itself.

Solution

You should normally call fftw execute routines outside of the parallel region. They have their own parallel regions inside them and they will take care of running the transform in parallel with that many threads as you requested during planning. They will re-use your existing OpenMP threads.

You can also call them inside a parallel region, but on different arrays, not on the same arrays! And then your plan should be planned to use 1 thread. Each thread would then preform a 2D transform of a slice of the array, for example.

The thread-safety means you can call the routines concurrently, but each for different data.