Search code examples
rapache-sparkhadoopdataframesparklyr

Are spark dataframes automatically deleted after disconnecting in sparklyr? If not, how do we do it?


What happens to the dataframes that were copied to spark in the following way on closing the connection?

library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
iris_tbl <- copy_to(sc, iris)
spark_disconnect(sc)

If they aren't deleted automatically, is there any easy way to delete all the dataframes that were created during a session apart from deleting each and every dataframe in the following way?

sc %>% spark_session() %>% invoke("catalog") %>% invoke("dropTempView", "iris")

Even if it is done automatically, is it done immediately or lazily when spark sees the necessity to clean up temporary views?

I have a script which continuously invokes spark and copies temporary dataframes into spark for some manipulations. I'm concerned about those temporary dataframes getting piled up in the cluster if not deleted in the end.


Solution

  • In general life cycle of temporary views in Spark is tightly coupled with life cycle of corresponding SparkSession and cannot be accessed outside its scope (global views are an exception, but same as standard views, cannot outlive their session). If JVM session is closed and / or garbage collected, corresponding temporary space will be scraped.

    However temporary views are not removed otherwise, so as long as the session lives, so do temporary tables.

    As I explained somewhere else (How to delete a Spark DataFrame using sparklyr?) this is usually not a serious concern.