Storing intermediate data in Spark when there are 100s of operations in an application

An RDD is inherently fault-tolerant due to its lineage. But if an application has 100s of operations it would get difficult to reconstruct going through all those operations. Is there a way to store the intermediate data? I understand that there are options of persist()/cache() to hold the RDDs. But are they good enough to hold the intermediate data? Would check-pointing be an option at all? Also is there a way specify the level of storage when check-pointing RDD?(like MEMORY or DISK etc.,)

Solution

While cache() and persist() is generic checkpoint is something which is specific to streaming.

caching - caching might happen on memory or disk

rdd.cache()

persist - you can give option where you want to persist your data either in memory or disk

rdd.persist(storage level)

checkpoint - you need to specify a directory where you need to save your data (in reliable storage like HDFS/S3)

val ssc = new StreamingContext(...)   // new context

ssc.checkpoint(checkpointDirectory)   // set checkpoint directory

There is a significant difference between cache/persist and checkpoint.

Cache/persist materializes the RDD and keeps it in memory and / or disk. But the lineage of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the cached RDDs are lost, they can be regenerated.

However, checkpoint saves the RDD to an HDFS file AND actually FORGETS the lineage completely. This is allows long lineages to be truncated and the data to be saved reliably in HDFS (which is naturally fault tolerant by replication).

http://apache-spark-user-list.1001560.n3.nabble.com/checkpoint-and-not-running-out-of-disk-space-td1525.html

(Why) do we need to call cache or persist on a RDD