Search code examples
apache-sparksparklyr

Do I need a local version of Spark when connecting to another spark cluster through sparklyr?


I have a production R cluster with Rstudio installed. Users are load-balanced onto an R server and write code there. I also have a separate Spark cluster which has 4 nodes. Using sparklyr I can easily connect to my spark cluster via:

sc <- sparklyr::spark_connect("spark://<my cluster>:7077")

The only thing is that I notice is that there is some Spark application usage on the R production server when I do this. I believe this is causing some issues. I have Spark installed on both R production servers and Spark cluster at the same SPARK_HOME location of /var/lib/Spark.

I would like to avoid having Spark on my R servers completely so that there is no usage related to Spark there. How do I do that with sparklyr?


Solution

  • Yes, you do need local Spark installation to submit Spark applications. The rest depends on the mode:

    • In the client mode driver will run on the same node from which you submit application.
    • In the cluster mode, driver will run on the cluster. There will be no local Spark process. This however doesn't support interactive processing.