Search code examples

Spark Invalid Checkpoint Directory

I have a long run iteration in my program and I want to cache and checkpoint every few iterations (this technique is suggested to cut long lineage on the web) so I wont have StackOverflowError, by doing this

for (i <- 2 to 100) {
      //cache and checkpoint ever 30 iterations
      if (i % 30 == 0) {
        //I use numEdges in order to start the transformation I need
      //graphs are stored to a list
      //here I use the graph of previous iteration to this iteration
      //and perform a transformation

and I have set the checkpoint directory like this

val sc = new SparkContext(conf)

However, when I finally run my program I get an Exception

Exception in thread "main" org.apache.spark.SparkException: Invalid checkpoint directory

I use 3 computers, each computer has Ubuntu 14.04, and I also use a pre-built version of spark 1.4.1 with hadoop 2.4 or later on each computer.


  • If you already set up HDFS on a cluster of nodes, you can find your hdfs address in "core-site.xml" located in the directory HADOOP_HOME/etc/hadoop. For me, the core-site.xml is set up as:


    Then you can create a directory on hdfs to save Rdd checkpoint files, let's name this directory RddChekPoint, by hadoop hdfs shell:

    $ hadoop fs -mkdir /RddCheckPoint

    If you use pyspark, after SparkContext is initialized by sc = SparkContext(conf), you can set checkpoint directory by


    when an Rdd is checkpointed, in the hdfs directory RddCheckPoint, you can see the checkpoint files are saved there, to have a look:

    $ hadoop fs -ls /RddCheckPoint