Search code examples
rwindowsparallel-processingrscriptrstudioapi

How to run another Rscript after several R jobs running in parallel are done?


The arrangement on how I need to run my scripts is first to run the 4 R scripts in parallel using the rstudioapi::jobRunScript() function. Each of the scripts that is running in parallel does not import anything from any environment but instead exports the data frames created to the global environment. My 5th R script builds on the data frames created by the 4 R scripts that run in parallel and also this 5th script is running in the console. If there's a way to run the 5th script in the background rather than in the console after the first 4 R scripts are done running in parallel, that would be a lot better. I'm also trying to reduce the total running time of the whole process.

Although I was able to figure out how to run the first 4 R scripts in parallel, my task isn't completely done because I can't find a way on how to trigger the running of my 5th R script. Hope y'all can help me here


Solution

  • This is a bit too open for my liking. While rstudioapi definitely can be used for running parallel tasks, it is not very versatile and does not give you very useful output. The parallel universe is well implemented in R with several packages that provide a much simpler and better interface for doing this. Here are 3 options, which also allow for the something to be 'output' from the different files.

    package = parallel

    With the parallel package we can achieve this very simply. Simply creating a vector of files to be sourced and executing source in each thread. The main process will lock while they are running, but if you have to wait for them to finish anyway, this doesn't really matter much.

    library(parallel)
    ncpu <- detectCores()
    cl <- makeCluster(ncpu)
    # full path to file that should execute 
    files <- c(...) 
    # use an lapply in parallel.
    result <- parLapply(cl, files, source)
    # Remember to close the cluster
    stopCluster(cl)
    # If anything is returned this can now be used.
    

    As a side note, several packages have a similar interface to the parallel package, which was build upon the snow package, so it is a good baseline to have knowledge of.

    package = foreach

    An alternative to the parallel package is the foreach package, which gives something similar to a for-loop interface, simplifying the interface while giving a more flexibility and automatically importing necessary libraries and variables (although it is safer to do this manually).
    The foreach package does depend on the parallel and doParallel packages to set up a cluster however

    library(parallel)
    library(doParallel)
    library(foreach)
    ncpu <- detectCores()
    cl <- makeCluster(ncpu)
    files <- c(...) 
    registerDoParallel(cl)
    # Run parallel using foreach
    # remember %dopar% for parallel. %do% for sequential.
    result <- foreach(file = files, .combine = list, .multicombine = TRUE) %dopar% { 
      source(file)
      # Add any code before or after source.
    }
    # Stop cluster
    stopCluster(cl)
    # Do more stuff. Result holds any result returned by foreach.
    

    While it does add a few lines of code, the .combine, .packages and .export makes for a very simple interface to work with parallel computing in R.

    package = future

    Now this is one of the more rare packages to be used. future provides a parallel interface that is more flexible than both parallel and foreach allowing for asynchronous parallel programming. The implementation can however seem a bit more daunting, while the example I provide below is only scratching the surface of what is possible.
    Also worth mentioning is that while the future package does provide automatic import of functions and packages necessary to run code, experience has made me aware that this is limited only to the first level of depth in any call (sometimes less), as such exporting is still necessary.
    While foreach depends on parallel (or similar) to start a cluster, foreach will start one itself using all the available cores. A simple call to plan(multiprocess) will start a multi core session.

    library(future)
    files <- c(...) 
    # Start multiprocess session
    plan(multiprocess)
    # Simple wrapper function, so we can iterate over the files variable easier
    source_future <- function(file)
      future(file)
    results <- lapply(files, source_future)
    # Do some calculations in the meantime
    print('hello world, I am running while waiting for the futures to finish')
    # Force waiting for the futures to finish
    resolve(results)
    # Extract any result from the futures
    results <- values(results)
    # Clean up the process (close down clusters)
    plan(sequential)
    # Run some more code.
    

    Now this might seem quite heavy at firsts, but the general mechanism is:

    1. Call plan(multiprocess)
    2. Execute some function(s) using future (or %<-%, which I wont go into)
    3. Do something else if you have more code to run, that does not depend on the processes
    4. Wait for the results using resolve, which works on a single future or multiple futures in a list (or environment)
    5. Collect the result using value for single futures or values for multiple futures in a list (or environment)
    6. Clear up any cluster running in the future environment by using plan(sequential)
    7. Continue with code that depended on the result of your futures.

    I believe these 3 packages provide interfaces to every necessary element of multiprocessing (at least on CPU) that any user needs to interface with. Other packages provide alternative interfaces while for asynchronous I am only aware of future and promises. In general I'd advice most users to be very careful when moving into asynchronous programming, as this can cause a whole suite of problems that are less frequent compares to synchronous parallel programming.

    I hope this may help provide an alternative to the (very limiting) rstudioapi interface, which I am fairly certain was never meant to be used for parallel programming by the users themselves, but more likely intended to perform tasks such as building a package in parallel by the interface itself.