Search code examples
scalaapache-sparkserializationkryoscala-pickling

Using Scala Pickling serialization In APACHE SPARK over KryoSerializer and JavaSerializer


While searching best Serialization techniques for apache-spark I found below link https://github.com/scala/pickling#scalapickling which states Serialization in scala will be more faster and automatic with this framework.

And as Scala Pickling has following advantages. (Ref - https://github.com/scala/pickling#what-makes-it-different)

So, I wanted to know whether this Scala Pickling (PickleSerializer) can be used in apache-spark instead of KryoSerializer.

  • If yes what are the necessary changes is to be done. (Example would be helpful)
  • If No why not. (Please explain)

Thanks in advance. And forgive me if I am wrong.

Note : I am using scala language to code apache-spark (Version. 1.4.1) application.


Solution

  • I visited Databricks for a couple of months in 2014 to try and incorporate a PicklingSerializer into Spark somehow, but couldn't find a way to include type information needed by scala/pickling into Spark without changing interfaces in Spark. At the time, it was a no-go to change interfaces in Spark. E.g., RDDs would need to include Pickler[T] type information into its interface in order for the generation mechanism in scala/pickling to kick in.

    All of that changed though with Spark 2.0.0. If you use Datasets or DataFrames, you get so-called Encoders. This is even more specialized than scala/pickling.

    Use Datasets in Spark 2.x. It's much more performant on the serialization front than plain RDDs