Search code examples
serializationapache-spark-sqlcase-class

Case Class serialization in Spark


In a Spark app (Spark 2.1) I'm trying to send a case class as input parameter of a function that is meant to run on executors

object TestJob extends App {

  val appName = "TestJob"
  val out = "out"
  val p = Params("my-driver-string")

  val spark = SparkSession.builder()
    .appName(appName)
    .getOrCreate()
  import spark.implicits._

  (1 to 100).toDF.as[Int].flatMap(i => Dummy.process(i, p))
    .write
    .option("header", "true")
    .csv(out)
}

object Dummy {

  def process(i: Int, v:Params): Vector[String] = {
    Vector { if( i % 2 == 1) v + "_odd" else v + "_even" }
  }
}

case class Params(v: String)

When I run it with master local[*] everything goes well, while when running in a cluster, Params class state is not getting serialized and the output results in null_even null_odd ...

Could you please help me understanding what I'm doing wrong?


Solution

  • Googling around I found this post that gave me the solution:Spark broadcasted variable returns NullPointerException when run in Amazon EMR cluster

    In the end the problem is due to the extend Apps