While working with datasets in Spark, we need to specify Encoders for serializing and de-serializing objects. We have option of using Encoders.bean(Class<T>)
or Encoders.kryo(Class<T>)
.
How are these different and what are the performance implications of using one vs another?
It is always advisable to use Kryo Serialization to Java Serialization for many reasons. Some of them are below.
Cache()
and Persist()
. This is very helpful during the phases like Shuffling
.saveAsObjectFile
on RDD and objectFile
method on SparkContext
supports only java serialization.sparkConf.registerKryoClasses(Array( classOf[A], classOf[B], ...))
or sparkConf.set("spark.kryo.registrator", "MyKryoRegistrator")
. Which saves a lot of space and avoids unnecessary metadata.Difference between the bean()
and javaSerialization()
is javaSerialization serializes objects of type T
using generic java serialization. This encoder maps T
into a single byte array (binary) field. Where as bean creates an encoder for Java Bean of type T
. Both of them uses Java Serialization the only difference is how they represent the objects into bytes.
Quoting from the documentation
JavaSerialization is extremely inefficient and should only be used as the last resort.