Search code examples
scalaapache-sparkemramazon-emrgraphframes

Spark AWS emr checkpoint location


I'm running a spark job on EMR but need to create a checkpoint. I tried using s3 but got this error message

17/02/24 14:34:35 ERROR ApplicationMaster: User class threw exception: 
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020

Here is my sample code

...
val sparkConf = new SparkConf().setAppName("spark-job")
  .set("spark.default.parallelism", (CPU * 3).toString)
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .registerKryoClasses(Array(classOf[Member], classOf[GraphVertex], classOf[GraphEdge]))
  .set("spark.dynamicAllocation.enabled", "true")


implicit val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

sparkSession.sparkContext.setCheckpointDir("s3://spark-jobs/checkpoint")
....

How can I checkpoint on AWS EMR?


Solution

  • There is a fix in the master branch for this to allow checkpoint to s3 too. I was able to build against it and it worked so this should be part of next release.