Search code examples
apache-sparkkryo

Do you need to register all implementations of an interface with Kryo?


When using Kryo, it's generally recommended that you register the classes you intend to serialize so the class name doesn't need to be included in the serialized data.

But in a class hierarchy, the actual implementation class may not be obvious. For example, if I have a Spark dataset that contains Vector objects, those objects' concrete class may be either DenseVector or SparseVector.

When I register the classes with Kryo, should I:

  1. Register the class according to the dataset's declared type (Vector)
  2. Register the concrete classes (DenseVector and SparseVector)
  3. All of the above, just in case?

Bonus question: if the Vector appears as a field in a tuple or case class, would you also need to register the product (Tuple2[Vector, Int] for example)?


Solution

  • Answer to main question

    The answer is... no 2 :) In other words:

    • you need to register the concrete classes only, and
    • you need to register every single concrete class that you may encounter1.

    Unfortunately, I have no documentation reference to back this up right now (I know it from experience).


    1 There is a special case, though, when you can register only an abstract class for the purposes of serialization/deserialization (not for Kryo.copy() though). This case is when:

    1. your serialization is the same for all subclasses, and
    2. during deserialization, you can decide which subclass to return based on the data.

    Look at the ImmutableListSerializer by Martin Grotzke. In the registerSerializers method, he registers only ImmutableList class for the purposes of serialization/deserialization because:

    1. serialization is the same, and
    2. during deserialization, ImmutableList.copyOf() takes care of returning the proper subclass.

    Answer to bonus question

    If the Vector appears in a tuple or case class, you need to register the appropriate class (e.g. Tuple2).

    Note that generic types do not matter here as long as you serialize using Kryo.writeClassAndObject (e.g. ImmutableListSerializer extends Serializer<ImmutableList<Object>>).