apache-spark pyspark apache-spark-sql spark-checkpoint

Dataframe Checkpoint Example Pyspark

I read about checkpoint and it looks great for my needs but I couldn't find a good example of how to use it.
My questions are:

Should I specifiy the checkpoint dir? Is it possible to do it like this:

df.checkpoint()
Are there any optional params that I should be aware about?
Is there a default checkpoint dir or I must specify one as default?
When I checkpoint dataframe and I reuse it - It autmoatically read the data from the dir that we wrote the files?

It will be great if you can share with me example of using checkpoint in pyspark with some explanation. Thanks!

Solution

You should assign the checkpointed dataframe to a variable as checkpoint "Returns a checkpointed version of this Dataset" (https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.checkpoint.html). So

df = df.checkpoint()

The only parameter is eager which dictates whether you want the checkpoint to trigger an action and be saved immediately, it is True by default and you usually want to keep it this way.

You have to set the checkpoint directory with SparkContext.setCheckpointDir(dirName) somewhere in your script before using checkpoints. Alternatively if you want to save to memory instead you can use localCheckpoint() instead of checkpoint() but that is not reliable and in case of issues/after termination the checkpoints will be lost (but it should be faster as it uses the caching subsystem instead of only writing to disk).

And yes, it should be read automatically, you can look at the history server and there should be "load data" nodes (I don't remember the exact name) at the start of blocks/queries