azure apache-spark pyspark batch-processing

When is it good to truncate lineage of the RDD graph in spark batch job

I am trying to understand the role of checkpoints in spark batch applications.

While i understood checkpoints are critical in structured streaming, i also saw batch application using them, in order to "truncate lineage".

dbutils.fs.rm('/user/checkpoint/stg_my_checkpoint', True)
spark.sparkContext.setCheckpointDir('/user/checkpoint/stg_my_checkpoint')

and after a few transformations :

df= df.checkpoint()

I read that by truncating the lineage, you free up memory resources in situations where you no longer need the RDD or want to avoid unnecessary recomputation of the RDD's partitions.

But i see that there are also risks involved (loss of fault tolerance, recomputation overhead).

I am wondering when does one know that the RDD is no longer needed?

Solution

As this picture from https://www.waitingforcode.com/apache-spark-sql/spark-sql-checkpoints/read#checkpoint_or_not

states, the prime situation for RDD lineage truncation is iterative processing for ML. We know this from Data Scientists using pyspark in practice as well. The artice explains it very well.

The key takeaway from this part is the lineage reduction by the checkpoint operation. It's also the most visible difference between checkpoint and cache. On the surface both "materialize" the dataset somehow but checkpoint does it without the lineage.NB: Note the advanced points on Eager and lazy checkpoints.

The other use case is a long set of transformation in a batch Spark App.
More generally though, from https://sparkbyexamples.com/spark/what-is-dag-in-spark/ :

Spark achieves fault tolerance using the DAG by using a technique called lineage, which is the record of the transformations that were used to create an RDD. When a partition of an RDD is lost due to a node failure, Spark can use the lineage to rebuild the lost partition.

The lineage is built up as the DAG is constructed, and Spark uses it to recover from any failures during the job execution. When a node fails, the RDD partitions that were stored on that node are lost, and Spark uses the lineage to recompute the lost partitions. Spark rebuilds the lost partitions by re-executing the transformations that were used to create the RDD.

To achieve fault tolerance, Spark uses two mechanisms:

A. RDD Persistence: When an RDD is marked as “persistent,” Spark will keep its partition data in memory or on disk, depending on the storage level used. This ensures that if a node fails, Spark can rebuild the lost partitions from the persisted data, rather than recomputing the entire RDD.

B. Checkpointing: Checkpointing is a mechanism to periodically save the RDDs to a stable storage like HDFS. This mechanism reduces the amount of recomputation required in case of failures. In case of a node failure, the RDDs can be reconstructed from the latest checkpoint and their lineage. NB: Note local checkpointing, less resilient though.

What I also know from trying out and asking in the past, is that if you need a new Spark App to run against check-pointed data after a failure of sorts from a pervious invocation, it is pretty horrendous to get the codign right to start from the checkpoints.