Search code examples
apache-sparkcachingapache-spark-sqlrdd

When will Spark clean the cached RDDs automatically?


The RDD, which have been cached used the rdd.cache() method from the scala terminal, are being stored in the memory.

That means it will consume some part of the ram being available for the Spark process itself.

Having said that if the ram is being limited, and more and more RDDs have been cached, when will spark clean the memory automatically which has been occupied by the rdd cache?


Solution

  • Spark will automatically un-persist/clean the RDD or Dataframe if the RDD is not used any longer. To check if a RDD is cached, check into the Spark UI and check the Storage tab and look into the Memory details.

    From the terminal, we can use ‘rdd.unpersist() ‘or ‘sqlContext.uncacheTable("sparktable") ‘

    to remove the RDD or tables from Memory. Spark made for Lazy Evaluation, unless and until you say any action, it does not load or process any data into the RDD or DataFrame.