Search code examples

Yaml serialization through camel: using base class load/dump and accessing type(self) in decorator

TL;DR: how to use type(self) in the decorator of a member function?

I would like to do serialization of derived classes and share some serialization logic in the base class in Python. Since pickle and simple yaml did not seem to be able to deal with this reliably, I then stumbled over camel which I consider a quite neat solution to the problem see this link.

Consider two extremely simplified classes B and A where B is inheriting from A. I want to be able to serialize B in my main function like this:

from camel import Camel, CamelRegistry
serializable_types = CamelRegistry()

# ... define A and B with dump and load functions ...

if __name__ == "__main__":
    serialization_interface = Camel([serializable_types])
    b = B(x=3, y=4)
    s = serialization_interface.dump(b)

I came up with two solutions that work:

Version 1: the dumping and loading is done in stand-alone functions outside of the class. Problems: not very elegant, function dumpA not automatically available to inheriting class in dumpB, more cumbersome function naming, function scope bigger than necessary

# VERSION 1 - dump and load in external functions
class A:

    def __init__(self, x):
        self._x = x

@serializable_types.dumper(A, 'object_A', version=None)
def dumpA(a):
    return {'x': a._x}

@serializable_types.loader('object_A', version=None)
def loadA(data, version):
    return A(data.x)

class B(A):

    def __init__(self, x, y):
        self._y = y

@serializable_types.dumper(B, 'object_B', version=None)
def dumpB(b):
    b_data = dumpA(b)
    b_data.update({'y': b._y})
    return b_data

@serializable_types.loader('object_B', version=None)
def loadB(data, version):
    return B(data.x)

Version 2: functions for loading and dumping are defined directly in the constructor. Function are still not available in the subclass :/

# VERSION 2 - dump and load functions defined in constructor
class A:

    def __init__(self, x):
        self._x = x

        @serializable_types.dumper(A, 'object_A', version=None)
        def dump(a):

        @serializable_types.loader('object_A', version=None)
        def load(data, version):
            return A(data.x)

    def to_dict(self):
        return {'x': self._x}

class B(A):

    def __init__(self, x, y):
        self._y = y

        @serializable_types.dumper(B, 'object_B', version=None)
        def dump(b):
            b_data = b.to_dict()
            return b_data

        @serializable_types.loader('object_B', version=None)
        def load(data, version):
            return B(data.x)

    def to_dict(self):
        b_data = super().to_dict()
        b_data.update({'y': b._y})
        return b_data

I would like to achieve an implementation that looks like this:

# VERSION 3 - dump and load functions are member functions
# ERROR: name 'A' is not defined
class A:

    def __init__(self, x):
        self._x = x

    @serializable_types.dumper(A, 'object_A', version=None)
    def dump(a):
        return {'x': a._x}

    @serializable_types.loader('object_A', version=None)
    def load(data, version):
        return A(data.x)

class B(A):

    def __init__(self, x, y):
        self._y = y

    @serializable_types.dumper(B, 'object_B', version=None)
    def dump(b):
        b_data = super().dump(b)
        b_data.update({'y': b._y})
        return b_data

    @serializable_types.loader('object_B', version=None)
    def load(data, version):
        return B(data.x)

This will not work cause in the definition of the dump functions, A and B are not defined. From a software design perspective however, I consider this to be the cleanest solution with fewest lines of code.
Is there a way to get the type definitions of A and B to work in the decorator? Or has anyone solved the problem in a different way? I came across this but couldn't see a straightforward way of applying it to my usecase.


  • Your version 3 is not going to work because, as you probably noticed, at the time the decorator is called, A is not defined yet.

    If you would write your decorator in the way before the @ syntactic sugar was added to Python:

    def some_decorator(fun):
        return fun
    def xyz():

    , that is:

    def some_decorator(fun):
        return fun
    def xyz():

    then that should be immediately clear.

    Your version 2, defers the registration of your loader and dumper routines until an instance of both A and B is created in some otherway than loading before you can do loading. That could be working if you created instances of both classes and then did dump, followed by load, from within one program. But if you only create B and want to dump it, then the functions for A have not registred and A.dump() is not available. And anyway if a program does both dump and load data, it is much more common to do the loading from some persistent storage first, and then do the dumping, and during loading the registration would not yet have taken place. So you would need some extra registration mechanism for all your classes and creation of at least one instance for each of these classes. Probably not what you want.

    In version 1, you cannot easily find dumpA while in dumpB, although it should be possible to look into the internals of serializable_types and find the parent class of B, this however is non-trivial, ugly and there is a better way by minimizing dumpB (and dumpA) into functions that return the value returned some method of B (resp. A), appropriately named dump:

    from camel import CamelRegistry, Camel
    serializable_types = CamelRegistry()
    # VERSION 1 - dump and load in external functions
    class A:
        def __init__(self, x):
            self._x = x
        def dump(self):
            return {'x': self._x}
    @serializable_types.dumper(A, 'object_A', version=None)
    def dumpA(a):
        return a.dump()
    @serializable_types.loader('object_A', version=None)
    def loadA(data, version):
        return A(data.x)
    class B(A):
        def __init__(self, x, y):
            self._y = y
        def dump(self):
            b_data = A.dump(self)
            b_data.update({'y': b._y})
            return b_data
    @serializable_types.dumper(B, 'object_B', version=None)
    def dumpB(b):
        return b.dump()
    @serializable_types.loader('object_B', version=None)
    def loadB(data, version):
        return B(data.x)
    if __name__ == "__main__":
        serialization_interface = Camel([serializable_types])
        b = B(x=3, y=4)
        s = serialization_interface.dump(b)

    which gives:

    x: 3
    y: 4

    That works because by the time dumpB is called, you have an instance of type B (otherwise you could not get at its attributes), and the methods of class B know about class A.

    Please note that doing return B(data.x) is not going to work in any of your versions as B's __init__ expects two parameters.

    I find the above rather unreadable.

    You indicate that "simple yaml did not seem to be able to deal with this reliably". I am not aware of why this would be true, but there is a lot of misunderstanding about YAML¹

    I recommend you take a look at ruamel.yaml (disclaimer: I am the author of that package). It requires registration of classes for dumping and loading, uses pre-defined method names for loading and dumping (from_yaml resp. to_yaml), and the "registration office" calls these methods including class information. So there is no need to defer the definition of these methods until you construct an object as in your version 2.

    You can either explicitly register a class or decorate the class as soon as the decorator is available (i.e. once you have your YAML instance). Since B is inherting from A, you only have to provide to_yaml and from_yaml in A and can re-use the dump methods from the previous example:

    import sys
    class A:
        yaml_tag = u'!object_A'
        def __init__(self, x):
            self._x = x
        def to_yaml(cls, representer, node):
            return representer.represent_mapping(cls.yaml_tag, cls.dump(node))
        def from_yaml(cls, constructor, node):
            instance = cls.__new__(cls)
            yield instance
            state = ruamel.yaml.constructor.SafeConstructor.construct_mapping(
                  constructor, node, deep=True)
        def dump(self):
            return {'x': self._x}
    import ruamel.yaml  # delayed import so A cannot be decorated
    yaml = ruamel.yaml.YAML()
    class B(A):
        yaml_tag = u'!object_B'
        def __init__(self, x, y):
            self._y = y
        def dump(self):
            b_data = A.dump(self)
            b_data.update({'y': b._y})
            return b_data
    # B not registered, because it is already decorated
    b = B(x=3, y=4)
    yaml.dump(b, sys.stdout)
    print('=' * 20)
    b = yaml.load("""\
    x: 42
    y: 196
    print('b.x: {.x}, b.y: {.y}'.format(b, b))

    which gives:

    x: 3
    y: 4
    b.x: 42, b.y: 196

    The yield in the above code is necessary to deal with instances that have (indirect) circular references to themselves and for which, obviously, not all arguments can be available at the time of object creation.

    ¹ E.g. one YAML 1.2 reference states that a YAML document begins with ---, where that is actually called a directives-end-marker and not document-start-marker for good reasons. And that ..., the document-end-marker, can only be followed by directives or
    ---, whereas the spec clearly indcates that it can be followed by comments and also by bare documents.