Search code examples
javaapache-sparkkryo

Spark Serialization strategy - should I be using Kryo exclusively?


I'm new to spark. And even newer to Kryo. In my spark app, I'm using kryo to serialize value objects but just use the Serializable interface for objects that house algorithms...the reason was that I didn't want to register every single class with Kryo.

Should I be using kryo exclusively? Is mixing & matching ok (like what I'm doing)?


Solution

  • When you set spark.serializer to org.apache.spark.serializer.KryoSerializer all objects inside RDDs (it doesn't cover the closures*) are serialized using Kryo. Class registration is only a way to improve performance (registered classes require only an integer id not a fully qualified class name to be stored with the serialized object). You can check relevant section of the Kryo documentation for details.

    In other words if you care about performance you should register all classes which have to be serialized in your program but one way or another you already use Kryo.


    * Closures are serialized using standard Java serialization and registration in Kryo doesn't affect that so if some objects are to be passed via closure you still have to use java.io.Serializable.