Search code examples
rdoparallel

How to call a parallelized script from command prompt?


I'm running into this issue and I for the life of me can't figure out how to solve it.

Quick summary before example:

I have several hundred data sets from which I want create reports on everyday. In order to do this efficiently, I parallelized the process with doParallel. From within RStudio, the process works fine, but when I try to make the process automatic via Task Scheduler on windows, I can't seem to get it to work.

The process within RStudio is:

I call a script that sources all of my other scripts, each individual script has a header section that performs the appropriate package import, so for instance it would look like:

get_files <- function(){
  get_files.create_path() -> path
  for(file in path){
   if(!(file.info(paste0(path, file))[['isdir']])){
    source(paste0(path, file))
   }
  }
}

get_files.create_path <- function(){
   return(<path to directory>)
}

#self call
get_files()

This would be simply "Source on saved" and brings in everything I need into the .GlobalEnv.

From there, I could simply type: parallel_report() which calls a script that sources another script that houses the parallelization of the report generations. There was an issue awhile back with simply calling the parallelization directly (I wonder if this is related?) and so I had to make the doParallel script a non-function housing script and thus couldn't be brought in with the get_files script which would start the report generation every time I brought everything in. Thus, I had to include it in its own script and save it elsewhere to be called when necessary. The parallel_report() function would simply be:

parallel_report <- function(){
  source(<path to script>)
}

Then the script that is sourced is the real parallelization script, and would look something like:

doParallel::registerDoParallel(cl = (parallel::detectCores() - 1))
foreach(name = report.list$names,
        .packages = c('tidyverse', 'knitr', 'lubridate', 'stringr', 'rmarkdown'),
        .export = c('generate_report'),
        .errorhandling = 'remove') %dopar% {
  tryCatch(expr = {
    generate_report(name)
  }, error = function(e){
    error_handler(error = e, caller = paste0("generate report for ", name,  " from parallel"), line = 28)
  })
}
doParallel::stopImplicitCluster()

The generate_report function is simply an .Rmd and render() caller:

generate_report <- function(<arguments>){
 #stuff
 generate_report.render(<arguments>)
 #stuff 
}  

generate_report.render <- function(<arguments>){
   rmarkdown::render(
    paste0(data.information@location, 'report_generator.Rmd'),
    params = list(
      name = name,
      date = date,
      thoughts = thoughts, 
      auto = auto),
    output_file = paste0(str_to_upper(stock), '_report_', str_remove_all(date, '-'))
  )
}

So to recap, in RStudio I would simply perform the following:

1 - Source save the script to bring everything

2 - type parallel_report

2.a - this calls directly the doParallization of generate_report

2.b - generate_report calls an .Rmd file that houses the required function calling and whatnot to produce the reports

And the process starts and successfully completes without a hitch.

In order to make the situation automatic via the Task Scheduler, I made a script that the Task Scheduler can call, named automatic_caller:

source(<path to the get_files script>) # this brings in all the scripts and data into the global, just 
# as if it were being done manually
tryCatch(
  expr = {
    parallel_report()
  }, error = function(e){
     error_handler(error = e, caller = "parallel_report from automatic_callng", line = 39)
})

The error_handler function is just an in-house script used to log errors throughout.

So then on the Task Schedule's tasks I have the Rscript.exe called and then the automatic_caller after that. Everything within the automatic_caller function works except for the report generation. The process completes almost automatically, and the only output I get is an error:

"pandoc version 1.12.3 or higher is required and was not found (see the help page ?rmarkdown::pandoc_available)."

But rmarkdown is within the .export call of the doParallel and it is in the scripts that use it explicitly, and in the actual generate_report it is called directly via rmarkdown::render().

So - I am at a complete loss.

Thoughts and suggestions would be completely appreciated.


Solution

  • So pandoc is apprently an executable that helps convert files from one extension to another. RStudio comes with its own pandoc executable so when running the scripts from RStudio, it knew where to point when pandoc is required.

    From the command prompt, the system did not know to look inside of RStudio, so simply downloading pandoc as a standalone executable gives the system the proper pointer.

    Downloded pandoc and everything works fine.