Search code examples
apache-sparkspark-graphx

Spark Invalid Checkpoint Directory


I have a long run iteration in my program and I want to cache and checkpoint every few iterations (this technique is suggested to cut long lineage on the web) so I wont have StackOverflowError, by doing this

for (i <- 2 to 100) {
      //cache and checkpoint ever 30 iterations
      if (i % 30 == 0) {
        graph.cache
        graph.checkpoint
        //I use numEdges in order to start the transformation I need
        graph.numEdges
      }
      //graphs are stored to a list
      //here I use the graph of previous iteration to this iteration
      //and perform a transformation
}

and I have set the checkpoint directory like this

val sc = new SparkContext(conf)
sc.setCheckpointDir("checkpoints/")

However, when I finally run my program I get an Exception

Exception in thread "main" org.apache.spark.SparkException: Invalid checkpoint directory

I use 3 computers, each computer has Ubuntu 14.04, and I also use a pre-built version of spark 1.4.1 with hadoop 2.4 or later on each computer.


Solution

  • If you already set up HDFS on a cluster of nodes, you can find your hdfs address in "core-site.xml" located in the directory HADOOP_HOME/etc/hadoop. For me, the core-site.xml is set up as:

    <configuration>
          <property>
               <name>fs.default.name</name>
               <value>hdfs://master:9000</value>
          </property>
    </configuration>
    

    Then you can create a directory on hdfs to save Rdd checkpoint files, let's name this directory RddChekPoint, by hadoop hdfs shell:

    $ hadoop fs -mkdir /RddCheckPoint
    

    If you use pyspark, after SparkContext is initialized by sc = SparkContext(conf), you can set checkpoint directory by

    sc.setCheckpointDir("hdfs://master:9000/RddCheckPoint")

    when an Rdd is checkpointed, in the hdfs directory RddCheckPoint, you can see the checkpoint files are saved there, to have a look:

    $ hadoop fs -ls /RddCheckPoint