Search code examples
rapache-sparkrstudiosparkrrscript

Running SparkR from RStudio returns "Cannot run program Rscript"


I am trying out SparkR with RStudio, but it doesn't seem to work. I have tried the suggested solutions on other questions, but I still can't figure out why it isn't running.

The code I am running is as follows

if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
  Sys.setenv(SPARK_HOME = "c://spark")
}
library(SparkR)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

sc<-sparkR.session(master="spark://192.168.56.1:7077",appName = "R Spark", sparkConfig = list(spark.cassandra.connection.host="localhost"), sparkPackages = "datastax:spark-cassandra-connector:1.6.0-s_2.11")
df<- as.DataFrame(faithful)
showDF(df)

The message I get is

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 192.168.56.1): java.io.IOException: Cannot run program "Rscript":
CreateProcess error=2, Das System kann die angegebene Datei nicht finden
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at org.apache.spark.api.r.RRunner$.createRProcess(RRunner.scala:348)
    at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:386)
    at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69)
    at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:50)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.MapPartitionsRDD.

I am trying to run it on a stand-alone cluster with 1 Worker,

Spark Version is 2.0.2

RStudio: 1.0.136

R: 3.3.2


Solution

  • I was having a similar problem under RStudio with a 2 node cluster.

    The issue is that while your R driver program has R installed, your worker node doesn't (or at least doesn't have Rscript in its execution path). As a result, when it tries to run a bit of R code on the worker instead of the master, it fails to find Rscript.

    Solution: install R and Rscript on your worker node.

    I hope this helps!