When using Kryo, it's generally recommended that you register the classes you intend to serialize so the class name doesn't need to be included in the serialized data.
But in a class hierarchy, the actual implementation class may not be obvious. For example, if I have a Spark dataset that contains Vector objects, those objects' concrete class may be either DenseVector or SparseVector.
When I register the classes with Kryo, should I:
Bonus question: if the Vector appears as a field in a tuple or case class, would you also need to register the product (Tuple2[Vector, Int] for example)?
The answer is... no 2 :) In other words:
Unfortunately, I have no documentation reference to back this up right now (I know it from experience).
1 There is a special case, though, when you can register only an abstract
class for the purposes of serialization/deserialization (not for Kryo.copy()
though). This case is when:
Look at the ImmutableListSerializer
by Martin Grotzke. In the registerSerializers
method, he registers only ImmutableList
class for the purposes of serialization/deserialization because:
ImmutableList.copyOf()
takes care of returning the proper subclass.If the Vector
appears in a tuple or case class, you need to register the appropriate class (e.g. Tuple2
).
Note that generic types do not matter here as long as you serialize using Kryo.writeClassAndObject
(e.g. ImmutableListSerializer
extends Serializer<ImmutableList<Object>>
).