Search code examples
rapache-sparksparklyr

spark_apply Cannot run program “Rscript”: in directory "C:\Users\username\AppData\Local\spark\spark-2.3.3-bin-hadoop2.7\tmp\local\spark-..\userFiles


Following first instructions of the book "Mastering Apache Spark with R" about spark_apply, on a local cluster under windows and using RGui, launching :

install.packages("sparklyr")
install.packages("pkgconfig")
spark_install("2.3")
Installing Spark 2.3.3 for Hadoop 2.7 or later.
spark_installed_versions()
library(dplyr,sparklyr)
sc <- spark_connect(master = "local", version = "2.3.3")
cars <- copy_to(sc, mtcars)    
cars %>% spark_apply(~round(.x))

is returning the following error:

spark_apply Cannot run program “Rscript”:  in directory "C:\Users\username\AppData\Local\spark\spark-2.3.3-bin-hadoop2.7\tmp\local\spark-..\userFiles-..  
CreateProcess error=2, The file specified can't be found

How to corectly install sparklyr and how to get ride of this error ?


Solution

  • The spark node needs the Rscript executable in its path. For the master node, it is possible to set the path to the Rscript executable using the following commands:

    config <- spark_config()
    config[["spark.r.command"]] <- "d:/path/to/R-3.4.2/bin/Rscript.exe"
    sc <- spark_connect(master = "local", config = config)
    

    Let find here more explanations and guidelines for distributed environments.