Search code examples
rapache-sparksparklyr

Connect sparklyr to remote spark connection


I would like to connect my local desktop RStudio session to a remote spark session via sparklyr. When you go to add a new connection in the sparklyr ui tab in RStudio and choose cluster is says that you have to be running on the cluster, or have a high bandwidth connection to the cluster.

Can anyone shed light on how to create that kind of connection? I am not sure how to create reproducible example of this, but in general what I would like to do is:

library(sparklyr)
sc <- spark_connect(master = "spark://ip-[MY_PRIVATE_IP]:7077", spark_home = "/home/ubuntu/spark-2.0.0", version="2.0.0")

from a remote server. I understand that there will be latency, especially if trying to pass data between the remotes. I also understand that it would be better to have the rstudio-server on the actual cluster- but that is not always possible, and I am looking for a sparklyr option for interacting between my server and my desktop RStudio session. Thanks.


Solution

  • As of sparklyr version 0.4, it is unsupported to connect from the RStudio desktop to a remote Spark cluster. Instead, as you mention, the recommended approach is to install RStudio Server within the Spark cluster.

    That said, the livy branch in sparklyr is exploring integration with Livy that would enable the RStudio desktop to connect to a remote Spark cluster through Livy.