performance recursion parallel-processing fortran openmp

Failed thread creation when parallelizing a branching recursive subroutine in Fortran with OpenMP

I am writing a recursive subroutine in Fortran that expands as a binary tree (i.e. the procedure calls itself twice until it reaches the end of a branch). The general algorithmic logic is:

'''

call my_subroutine(inputs, output)
  use input to generate possible new_input(:,1) and new_input(:,2)
  do i=1,2
    call my_subroutine(new_input(:,i), new_output(i))
  enddo
  output = best(new_output(1), new_output(2))

'''

In principle, this could be substantially accelerated through parallel computing, however when I use OpenMP to parallelize the loop, running the resulting executable aborts with the error:

libgomp: Thread creation failed: Resource temporarily unavailable Thread creation failed: Resource temporarily unavailable

I'm guessing that the stack size is too large, but I haven't found a resolution or workaround. Are there ways I can use parallel computing to improve the performance of this kind of algorithm?

-Does OpenMP or gfortran have options to help avoid these issues?

-Would it help to parallelize only above or below a certain level in the tree?

-Would c or c++ be a better option for this application?

I am working on macOS Catalina. Stack size is hard capped at 65532. My environment variables are:

OMP_NESTED=True

OMP_DYNAMIC=True

Solution

That sounds more like your code is creating too many threads due to a very deep recursion. There are ways to mitigate it. For example, OpenMP 4.5 introduced the concept of maximum active levels controlled by the max-active-levels-var ICV (internal control variable). You may set its value by either setting the OMP_MAX_ACTIVE_LEVELS environment variable or by calling omp_set_max_active_levels(). Once the level of nesting reaches that specified by max-active-levels-var, parallel regions nested further deeper are deactivated, i.e., they will execute sequentially without spawning new threads.

If your compiler does not support OpenMP 4.5, or if you want your code to be backward compatible with older compilers, then you can do it manually by tracking the level of nesting and deactivating the parallel region. For the latter, there is the if(b) clause that when applied to the parallel region makes it active only when b evaluates to .true.. A sample parallel implementation of your code:

subroute my_subroutine(inputs, output, level)
  use input to generate possible new_input(:,1) and new_input(:,2)
!$omp parallel do schedule(static,1) if(level<max_levels)
  do i=1,2
    call my_subroutine(new_input(:,i), new_output(i), level+1)
  enddo
!$omp end parallel do
  output = best(new_output(1), new_output(2))
end subroutine my_subroutine

The top level call to my_subroutine has to be with a level equal to 0.

No matter how exactly you implement it, you'll need to experiment with the value of the maximum level. The optimal value will depend on the number of CPUs/cores and the arithmetic intensity of the code and will vary from system to system.

A better alternative to the parallel do construct would be to use OpenMP tasks, again, with a cut-off at a certain level of nesting. The good thing about tasks is that you can fix the number of OpenMP threads in advance and the tasking runtime will take care of workload distribution.

subroutine my_subroutine(inputs, output, level)
  use input to generate possible new_input(:,1) and new_input(:,2)
!$omp taskloop shared(new_input, new_output) final(level>=max_levels)
  do i=1,2
    call my_subroutine(new_input(:,i), new_output(i), level+1)
  end do
!$omp taskwait
  output = best(new_output(1), new_output(2))
end subroutine my_subroutine

Here, each iteration the loop becomes a separate task. If max_levels of nesting has been reached, the tasks become final, which means they will not be deferred (i.e., will execute sequentially) and each nested task will be final too, effectively stopping parallel execution further down the recursion tree. Task loops are a convenience feature introduced in OpenMP 4.5. With earlier compilers, the following equivalent code will do:

subroutine my_subroutine(inputs, output, level)
  use input to generate possible new_input(:,1) and new_input(:,2)
  do i=1,2
!$omp task shared(new_input, new_output) final(level>=max_levels)
    call my_subroutine(new_input(:,i), new_output(i), level+1)
!$omp end task
  end do
!$omp taskwait
  output = best(new_output(1), new_output(2))
end subroutine my_subroutine

There are no parallel constructs in the tasking code. Instead, you need to call my_subroutine from within a parallel region and the idiomatic way is to do it like this:

!$omp parallel
!$omp single
  call my_subroutine(inputs, output, 0)
!$omp end single
!$omp end parallel

There is a fundamental difference between the nested parallel version and the one using tasks. In the former case, at each recursive level the current thread forks in two and each thread does one half of the computation in parallel. Limiting the level of active parallelism is needed here in order to prevent the runtime from spawning too many threads and exhausting the system resources. In the latter case, at each recursive level two new tasks are created and deferred for later, possibly parallel execution by the team of threads associated with the parallel region. The number of threads stays the same and the cut-off here limits the build-up of tasking overhead, which is way smaller than the overhead of spawning new parallel regions. Hence, the optimal value of max_levels for the tasking code will differ significantly from the optimal value for the nested parallel code.