Search code examples
rparallel-processingnestedscopingmapply

Saving R objects to global environment from inside a nested function called by a parent function using mcmapply


I am trying to write an R-script that uses nested functions to save multiple data.frames (parallelly) to global environment. The below sample code works fine in Windows. But when I moved the same code to a Linux server, the objects the function - prepare_output() saves to global environment are not captured by the save() operation in function - get_output().

Am i missing something that is fundamentally different on how mcmapply affects scoping in Linux vs Windows?

library(data.table)
library(parallel)

#Function definitions
default_case <- function(flag){
  if(flag == 1){
    create_input()
    get_output()
  }else{
    Print("select a proper flag!")
  }
}

create_input <- function(){
  dt_initial <<- data.table('col1' = c(1:20), 'col2' = c(21:40)) #Assignment to global envir
}


get_output<- function(){

  list1 <- c(5,6,7,8)
  dt1 <- data.table(dt_initial[1:15,])

  prepare_output<- function(cnt){
    dt_new <- data.table(dt1)
    dt_new <- dt_new[col1 <= cnt,  ]
    assign(paste0('dt_final_',cnt), dt_new, envir =  .GlobalEnv )
    #eval(call("<<-",paste0('dt_final_',cnt), dt_new))

    print('contents in global envir inside:')
    print(ls(name = .GlobalEnv)) # This print all object names dt_final_5 through dt_final_8 correctly
  }

  mcmapply(FUN = prepare_output,list1,mc.cores = globalenv()$numCores)


  print('contents in global envir outside:')
  print(ls(name = .GlobalEnv)) #this does NOT print dataframes generated and assigned to global in function prepare_output

  save( list = ls(name = .GlobalEnv)[ls(name = .GlobalEnv) %like% 'dt_final_' ], file = 'dt_final.Rdata')
}

if(Sys.info()['sysname'] == "Windows"){numCores <- 1}else{numCores <- parallel::detectCores()}
print('numCores:')
print(numCores)

#Function call
default_case(1)

The reason I an using nested structure is because the preparation of dt1 is time taking and I do not want to increase the execution time by its execution every loop in the apply call.


Solution

  • (Sorry, I'll write this as an 'Answer' because the comment box is too brief)

    The best solution to your problem would be to make sure you return the objects you produce rather than trying to assign them from inside a function to an external environment [edit 2020-01-26] which never works in parallel processing because parallel workers do not have access to the environments of the main R process.

    A very good rule of thumb in R that will help you achieve this: Never use assign() or <<- in code - neither for sequential nor for parallel processing. At best, you can get such code to work in sequential mode but, in general, you will end up with hard to maintain and error-prone code.

    By focusing on returning values (y <- mclapply(...) in your example), you'll get it right. It also fits in much better with the overall functional design of R and parallelizes more naturally.

    I've got a blog post 'Parallelize a For-Loop by Rewriting it as an Lapply Call' from 2019-01-11 that might help you transition to this functional style.