Search code examples
scalaapache-sparkkryo

Registering Classes with Kryo via SparkSession in Spark 2+


I'm migrating from Spark 1.6 to 2.3.

I need to register custom classes with Kryo. So what I see here: https://spark.apache.org/docs/2.3.1/tuning.html#data-serialization

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)

The problem is... everywhere else in Spark 2+ instructions, it indicates that SparkSession is the way to go for everything... and if you need SparkContext it should be through spark.sparkContext and not as a stand-alone val.

So now I use the following (and have wiped any trace of conf, sc, etc. from my code)...

val spark = SparkSession.builder.appName("myApp").getOrCreate()

My question: where do I register classes with Kryo if I don't use SparkConf or SparkContext directly?

I see spark.kryo.classesToRegister here: https://spark.apache.org/docs/2.3.1/configuration.html#compression-and-serialization

I have a pretty extensive conf.json to set spark-defaults.conf, but I want to keep it generalizable across apps, so I don't want to register classes here.

When I look here: https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.sql.SparkSession

It makes me think I can do something like the following to augment my spark-defaults.conf:

val spark = 
  SparkSession
    .builder
    .appName("myApp")
    .config("spark.kryo.classesToRegister", "???")
    .getOrCreate()

But what is??? if I want to register org.myorg.myapp.{MyClass1, MyClass2, MyClass3}? I can't find an example of this use.

Would it be:

.config("spark.kryo.classesToRegister", "MyClass1,MyClass2,MyClass3")

or

.config("spark.kryo.classesToRegister", "class org.myorg.mapp.MyClass1,class org.myorg.mapp.MyClass2,class org.myorg.mapp.MyClass3")

or something else?

EDIT

when I try testing different formats in spark-shell via spark.conf.set("spark.kryo.classesToRegister", "any,any2,any3") i never get any error messages no matter what I put in the string any,any2,any3.

I tried making any each of the following formats

  • "org.myorg.myapp.myclass"
  • "myclass"
  • "class org.myorg.myapp.myclass"

I can't tell if any of these successfully registered anything.


Solution

  • Have you tried the following, it should work since it actually a part of the SparkConf API and I think the only thing missing is that you just need to plug it into the SparkSession:

      private lazy val sparkConf = new SparkConf()
        .setAppName("spark_basic_rdd").setMaster("local[*]").registerKryoClasses(...)
      private lazy val sparkSession = SparkSession.builder()
        .config(sparkConf).getOrCreate()
    

    And if you need a Spark Context you can call: private lazy val sparkContext: SparkContext = sparkSession.sparkContext