Search code examples
apache-sparkpysparkapache-zeppelin

PySpark get checkpoint directory (version < 3.1.0)


We can set the checkpoint directory path in PySpark using the code below:

spark.sparkContext.setCheckpointDir('/checkpoints')

As SparkContext.getCheckpointDir() is only introduced in PySpark version 3.1.0, how to get the checkpoint directory path using an older version PySpark like v2.4.3 ?


Solution

  • SparkContext.getCheckpointDir() is only implemented in PySpark version 3.1.0, but luckily it was already implemented in the underlying Scala codebase in v2.4.3. You can see that here.

    You can access the underlying sparksession (JavaSparkContext) with the _jsc attribute. The following works in a pyspark REPL on version 2.4.5:

    >>> spark.sparkContext.setCheckpointDir('/checkpoints')
    >>> sc._jsc.sc().getCheckpointDir().get()
    'file:/checkpoints/1829fbb4-0b7b-44c5-b275-50276d063565'