Search code examples
rsparklyr

How to store data in a Spark cluster using sparklyr?


If I connect to a Spark cluster, copy some data to it, and disconnect, ...

library(dplyr)
library(sparklyr)
sc <- spark_connect("local")
copy_to(sc, iris)
src_tbls(sc)
## [1] "iris"
spark_disconnect(sc)

then the next time I connect to Spark, the data is not there.

sc <- spark_connect("local")
src_tbls(sc)
## character(0)
spark_disconnect(sc)

This is different to the situation of working with a database, where regardless of how many times you connect, the data is just there.

How do I persist data in the Spark cluster between connections?

I thought sdf_persist() might be what I want, but it appears not.


Solution

  • Spark is technically an engine that runs on the computer/cluster to execute tasks. It is not a database or file-system. You can save the data when you are done to a file-system and load it up during your next session.

    https://en.wikipedia.org/wiki/Apache_Spark