Search code examples
c#serializationprotobuf-netdatacontractserializer

Understanding the concept of 'serializing by reference'


I'm writing my own binary serializer optimized for game development. So far it's fully functional. It emits IL to generate the [de]serialization methods given a sequence of types in advance. The only missing feature is serializing things by reference, everything is currently being serialized by value.

In order to implement it, I have to understand it first. This is what I'm finding to be a bit tricky. Let me show you what I understood in these couple of examples:

Example 1 (as seen here):

public class Person
{
    public string Name;
    public Person Friend;
}

static void Main(string[] args)
{
    Person p1 = new Person();
    p1.Name = "John";

    Person p2 = new Person();
    p2.Name = "Mike";

    p1.Friend = p2;

    Person[] group = new Person[] { p1, p2 };

    var serializer = new DataContractSerializer(group.GetType(), null, 
        0x7FFF /*maxItemsInObjectGraph*/, 
        false /*ignoreExtensionDataObject*/, 
        true /*preserveObjectReferences : this is where the magic happens */, 
        null /*dataContractSurrogate*/);

    serializer.WriteObject(Console.OpenStandardOutput(), group);
}

Now this is completely understood. We have a root object which is the array, referencing two unique persons. The p1.Friend happens to be the p2. So instead of serializing the p1.Friend by value we just store an id that points to p2 which we've already serialized.

However; have a look at this second example:

    static void Example2()
    {
        var p1 = new Person() { Name = "Diablo" };
        var p2 = new Person() { Name = "Mephesto" };

        p1.Friend = p2;

        var serializer = new DataContractSerializer(typeof(Person), null, 0x7FFF, false, true, null);

        serializer.WriteObject(Console.OpenStandardOutput(), p1);
        Console.WriteLine("\n");
        serializer.WriteObject(Console.OpenStandardOutput(), p2);
    }

Now, according to my understanding: when serializing p1 the serializer will serialize p1.Name and p1.Friend. In the second WriteObject, the serializer has already serialized p2 (which is p1.Friend) so it just serializes an id that points to p1.Friend instead of serializing it by value.

Running the code and viewing the output it doesn't seem to be the case. In the 2nd output we see the serializer serializing p2 by value as if it hasn't came across it yet... And that I didn't get. It's like there's an id counter internally that gets reset at the end of WriteObject

enter image description here

Here's another similar example:

    static void Example3()
    {
        var p1 = new Person() { Name = "Diablo" };
        var p2 = p1;

        var serializer = new DataContractSerializer(typeof(Person), null, 0x7FFF, false, true, null);

        serializer.WriteObject(Console.OpenStandardOutput(), p1);
        Console.WriteLine("\n");
        serializer.WriteObject(Console.OpenStandardOutput(), p2);
    }

Again, the second output shows that we're serializing p2 as if we haven't encountered a definition for it yet.

Note that I didn't choose DataContractSerializer for any particular reason, any serializer that supports serializing by reference works.

I tried to ILSpy on DataContractSerializer but I got lost quickly and couldn't figure out much.

  1. In Example2, why didn't the serializer store an id to p1.Friend when serializing p2? - Is 'serializing by reference' only applied to a single object hierarchy, or how does it work in general?
  2. It seems to me that serializing by reference will automatically handle circular referencing (A <-> B), is that correct? or do I need to do other things to make sure I won't fall into an infinite loop?
  3. I assume serializing by reference makes sense only when applied on reference-types and not value-types, correct?

I've tagged protobuf-net cause it's similar in that it's a binary serializer and emits IL. I would love to hear how seiralizing by reference is implemented there :p


Solution

    1. Each call to write-object is a separate serialization context; the reference-tracking is not preserved between calls
    2. As long as you correctly identify previously seen values, it shouldn't get recursive, but a depth check can help avoid issues
    3. Correct, although you could attempt to recognise semantically identical value types if you wanted (perhaps the structural equality interface)

    Additional thought: if you apply this to strings, you might want to special-case as effective equality rather than reference equality - no point serialising two different instances (references) of the same string