r parallel-processing scope global doparallel

Nested maximisation in parallel with the need of using global variables in R

I've an R code with two nested optimisations. There's an outer and an inner function. The outer function passed certain parameters along to the inner function, which performs an optimisation on another set of parameters. These parameters are then sent to the outer function which optimizes an objective function based on the parameters estimated in the inner function. The estimates on the outer function are then passed to the inner function, which finds the new optimal set of parameters in the inner function, and passes them to the outer function. These loops repeat until the objective function in the outer loop is minimised.

The code works by setting the inner parameters as global variables, so that after the maximisation in the outer loop, the code passes these global variables to the inner loop.

I would like to run this procedure for different datasets in parallel. I understand that I cannot use the global variables in parallel, and I was thinking of saving text files with different file names at each loop: I would save a file with the value of the parameters at the end of the outer loop and reopen it at the beginning of the outer loop. However, is there a more efficient way to do this? I don't think that using a list would work. Thank you.

Example:

require(nloptr)
y = rnorm(100)
x = runif(100)*5

inner <- function(beta) mean((y-beta*x)^2)

outer <- function(alpha) {

  if (!exists("storage") | is.null(storage$solution))
    beta <- runif(1)
  else
    beta <- storage$solution

  sol.inner <-nloptr(
    x0     = beta,
    eval_f = inner,
    opts   = list(
      algorithm = "NLOPT_LN_BOBYQA",
      ftol_rel  = 1.e-6,
      ftol_abs  = 1.e-7,
      xtol_rel  = 1.e-6, 
      xtol_abs = 0,
      maxeval = 1000000
    )
  ) 

  storage <-  c()
  storage <<- append(storage,sol.inner)
  beta    <-  sol.inner$solution

  mean(x^2 - alpha* x + beta)^2

}

alpha0    <- runif(1)
storage   <- c()

sol.outer <- nloptr(
  x0     = alpha0,
  eval_f = outer,
  opts   = list(
    algorithm ="NLOPT_LN_BOBYQA",
    ftol_rel  = 1.e-6,
    ftol_abs  = 1.e-7,
    xtol_rel  = 1.e-6,
    xtol_abs  = 0,
    maxeval   = 1000000
  )
) 

sol.outer

Solution

While very neat, I don't recommend using the <<- operator in general. If you want to modify elements within functions so that you can use them once the function has exited, I recommend you using environments instead.

The thing with parallel processing is that, as implemented in the parallel package, each thread/offspring/child is running in its own session, meaning that those do not interact with each other. In that scenario, you can do pretty much what you want within each offspring process. Here is an example of what you are trying to do:


# Simulating 4 random datasets
set.seed(131)
datasets <- replicate(4, {
  list(
    y = rnorm(100),
    x = runif(100)*5
  )
}, simplify = FALSE)


inner <- function(beta, x, y) mean((y-beta*x)^2)

outer <- function(alpha, storage, x, y) {

  if (!length(storage$solution))
    beta <- runif(1)
  else
    # Take the first value, which is the latest to be
    # stored (see below)
    beta <- storage$solution[[1]]

  sol.inner <- nloptr(
    x0     = beta,
    eval_f = inner,
    opts   = list(
      algorithm = "NLOPT_LN_BOBYQA",
      ftol_rel  = 1.e-6,
      ftol_abs  = 1.e-7,
      xtol_rel  = 1.e-6, 
      xtol_abs = 0,
      maxeval = 1000000
    ),
    y = y,
    x = x
  ) 

  # We can append the latest beta as a list
  storage$solution <- c(list(sol.inner$solution), storage$solution)
  beta    <-  sol.inner$solution

  mean(x^2 - alpha* x + beta)^2

}

# Parallel solution with PSOCKcluster --------------------
library(parallel)

# Setting up the cluster object
cl <- makePSOCKcluster(4)

# We need to export the objects we plan to use within
# each session this includes loading the needed packages
clusterExport(cl, c("outer", "inner"))
invisible(clusterEvalQ(cl, library(nloptr)))
invisible({
  clusterEvalQ(cl, {
    # Be careful about random numbers in parallel!
    # This example is not reproducible right now
    alpha0    <- runif(1)

    # This should be an environment, which is easier to handle
    storage   <- new.env() 
  })
})



# You can send data to the offspring sessions and 
# these will be evaluated in separate R sessions 
ans <- parLapply(cl, datasets, function(d) {

  # Making the variables available to the program
  y <- d$y
  x <- d$x

  sol.outer <- nloptr(
    x0     = alpha0,
    eval_f = outer,
    opts   = list(
      algorithm ="NLOPT_LN_BOBYQA",
      ftol_rel  = 1.e-6,
      ftol_abs  = 1.e-7,
      xtol_rel  = 1.e-6,
      xtol_abs  = 0,
      maxeval   = 1000000
    ),
    x = d$x,
    y = d$y,
    # Passing the environment as an extra
    #  argument to the function
    storage = storage
  ) 

  list(
    sol     = sol.outer,
    storage = storage
  )

})

# Stopping the R sessions
stopCluster(cl)

# Checking out the storage vectors
lapply(ans, function(x) unlist(x$storage$solution))
#> [[1]]
#>  [1] -0.04112901 -0.04112901 -0.04112901 -0.04112901 -0.04112901
#>  [6] -0.04112901 -0.04112901 -0.04112901 -0.04112901 -0.04112901
#> [11] -0.04112901 -0.04112901 -0.04112901 -0.04112901 -0.04112901
#> [16] -0.04112901
#> 
#> [[2]]
#>  [1] -0.06877397 -0.06877397 -0.06877397 -0.06877397 -0.06877397
#>  [6] -0.06877397 -0.06877397 -0.06877397 -0.06877397 -0.06877397
#> [11] -0.06877397 -0.06877397 -0.06877397 -0.06877397 -0.06877397
#> [16] -0.06877397 -0.06877397
#> 
#> [[3]]
#>  [1] 0.004505708 0.004505708 0.004505708 0.004505708 0.004505708
#>  [6] 0.004505708 0.004505708 0.004505708 0.004505708 0.004505708
#> [11] 0.004505708 0.004505708 0.004505708 0.004505708 0.004505708
#> [16] 0.004505708 0.004505708
#> 
#> [[4]]
#>  [1] -0.02001445 -0.02001445 -0.02001445 -0.02001445 -0.02001445
#>  [6] -0.02001445 -0.02001445 -0.02001445 -0.02001445 -0.02001445
#> [11] -0.02001445 -0.02001445 -0.02001445 -0.02001445 -0.02001445
#> [16] -0.02001445

^{Created on 2019-11-20 by the reprex package (v0.3.0)}

One thing to notice here is that I modified your functions so that arguments are passed explicitly, hence we will not be dealing with scoping in this case. This is usually safer and more recent versions of R are smart enough to avoid copying objects when passed to functions.

One last bit to point out is that, if your datasets are large, it is better to actually load them within the offspring sessions to avoid duplicating memory (In general this last point is not a problem if you use makeForkCluster, but this is only available for unix based systems).