Search code examples
rforeachparallel.foreach

R: how to use foreach for a pre-specified number of replicates like in a while loop


library(foreach)
library(doMC)


myfun <- function(threshold){
  val <- rnorm(1, mean = 0, sd = 1)
  if(val > threshold){
    stop("bad")
  }else return(val)
}

results <- vector("list", length = 10)
parallel_fun <- function(reps, threshold){
  registerDoMC(cores = 48)
  results = foreach (j = 1:reps, .combine = rbind) %dopar% {
    myfun(threshold)
  }
}

> parallel_fun(reps = 10, threshold = 0)
 Error in { : task 1 failed - "bad" 

The above is a simple, reproducible example. I want to parallelize myfun for a total of reps = 10 replicates. myfun may stop if the val that was generated is greater than some threshold. In that case, I want to stop running myfun and not have it return val. In the end, I want my results to have 10 vals that are greater than some threshold. Therefore, I thought maybe a while loop would be more appropriate here, since I want to keep it running until I have 10 values that satisfy the threshold. Is it possible to re-purpose foreach for parallelizing a while loop?


Solution

  • Control flow

    Using exceptions for control flow is often discouraged. Ideally,

    Use function that already does what you want

    In this specific example, you are simulating truncate normal distribution. So you could use truncnorm function from the truncnorm package.

    Rewrite the function

    Alternatively, rewrite the myfun to always return correct value:

    myfun = function(threshold){
        repeat{
            val = rnorm(1, 0, 1)
            if(val <= threshold)
                break
            }
        val 
        }
    

    This is just one of the possible variants. Here I am using a custom do-while construct.

    Note that depending on the threshold, a large or potentially infinite number of iterations might take place, so tread carefully and either put a maximum number of iterations in place or do some preliminary checks if threshold is not outside of a maximum range of the function in question, ideally both.

    With this, you should be able to run foreach easily as you are doing right now.

    Write a wrapper

    If you don't have control over the myfun, you need to construct wrapper, the construct might be almost identical to the function above:

    wrap_myfun = function(threshold){
        repeat{
            val = try(myfun(threshold))
            if(is.numeric(val))
                break
            }
        val
        }
    

    Keeping track of iterations:

    If you need to keep track of the number of iterations it took you to generate said numbers, you can just rewrite the repeat into a for cycle or just add counter and another option:

    wrap_myfun = function(threshold, .maxiter=10^9, .default=NA){
        iter = 1
        repeat{
            val = try(myfun(threshold))
            if(is.numeric(val))
                break
    
            if(iter >= .maxiter){
                val = .default 
                break
                }
    
            iter = iter + 1
            }
        list("value"=val, "iterations"=iter)
        }
    

    Alternatively, instead of assigning default value, you can use `stop("maximum iterations reached"). That depends on how serious is the problem.

    This way, you have moved all the logic into the data generating function and you do not have to manage the queues implemented in the foreach. The load should be distributed among the cores equally (past the potentially randomly long computation time for some iterations, but that is something you cannot influence).