Search code examples
apache-sparkapache-spark-datasetencoderkryo

Spark dataset encoders: kryo() vs bean()


While working with datasets in Spark, we need to specify Encoders for serializing and de-serializing objects. We have option of using Encoders.bean(Class<T>) or Encoders.kryo(Class<T>).

How are these different and what are the performance implications of using one vs another?


Solution

  • It is always advisable to use Kryo Serialization to Java Serialization for many reasons. Some of them are below.

    • Kryo Serialization is faster than Java Serialization.
    • Kryo Serialization uses less memory footprint especially, in the cases when you may need to Cache() and Persist(). This is very helpful during the phases like Shuffling.
    • Though Kryo is supported for caching and shuffling it is not supported during persistence to the disk.
    • saveAsObjectFile on RDD and objectFile method on SparkContext supports only java serialization.
    • The more Custom Data Types you are handling in your datasets the more complexity it is to handle them. Therefore, It is usually the best practice to use a uniform serialization like Kryo.
    • Java’s serialization framework is notoriously inefficient, consuming too much CPU, RAM and size to be a suitable large scale serialization format.
    • Java Serialization needs to store the fully qualified class names while serializing objects.But, Kryo lets you avoid this by saving/registering the classes sparkConf.registerKryoClasses(Array( classOf[A], classOf[B], ...)) or sparkConf.set("spark.kryo.registrator", "MyKryoRegistrator"). Which saves a lot of space and avoids unnecessary metadata.

    Difference between the bean() and javaSerialization() is javaSerialization serializes objects of type T using generic java serialization. This encoder maps T into a single byte array (binary) field. Where as bean creates an encoder for Java Bean of type T. Both of them uses Java Serialization the only difference is how they represent the objects into bytes.

    Quoting from the documentation

    JavaSerialization is extremely inefficient and should only be used as the last resort.