Search code examples

Run R script with Rscript vs spark-submit

I don't understand the difference between running a R file using Rscript vs spark-submit.

In the file I pass the options to connnect to the cluster so I don't know what is the adventage of using spark-submit.

sparkR.session(master = "spark://...", appName = "test", sparkConfig = list(spark.driver.memory = "1g", spark.driver.cores = 1L, spark.executor.memory = "2g", spark.cores.max = 2L))

What I do in the R program after creating the spark session is querying a parquet file stored in HDFS using SQL.

I tried both ways of running my program and they do exactly the same thing I think.

Thanks in advance


    • Calling SparkR program as an R script just evaluates it as a plain R program. It is fine for simple cases, but it is limited.
    • Using spark-submit allows to you to set a lot of Spark specific options including, but not limited, to master URI, deploy mode, memory, cores, configuration options, jars, packages and so on.

      Most of these can set using Spark configuration or hard coded in the script, but spark-submit offers more flexibility.

    The same applies to other supported languages (Java, Python, Scala).