Search code examples
c++serializationboost-serialization

boost::serialization of one instance per unique ID


I'm trying to boost::serialize structures which point to objects (say, of symbol class) implementing an idea of a single-instance-per-unique-<something>. That means, those objects are not created directly, but using a static method symbol::get(). This method retrieves an existing object from some global dictionary or creates a new object if necessary.

Now the hard part is that in my system I have many large structures with pointers to such symbols. The structures don't fit all in the memory at the same time. So I need to sequentially build, preprocess and serialize each of the structures separately. Later I'll deserialize and process structures on request.

Standard boost::serialize methods, namely load(...) and save(...) don't work here. Uppon deserialization of a structure would lost the system-wide uniqueness of symbols and serialization would waste a lot of space (my symbol objects are quite large). I've read the boost docs and found that for nonstandard constructors I can use save_construct_data and load_construct_data functions. But docs say also, the default load_construct_data "just uses the default constructor to initialize previously allocated memory". So again this isn't it.

The question is: how can I bypass this loading function so I can avoid any allocation and use my symbol::get() instead? Or maybe there is a more elegant solution?

EDIT: attached a simple code demonstrating the problem.

struct structure_element {
};

class symbol : public structure_element {
  symbol(string x);
  map<string, symbol> known_symbols;
public:
  static symbol *get(string x) {
    if (known_symbols.find(x) != known.symbols.end()){
      known_symbols[x] = symbol(x);
    }
    return &known_symbols[x];
  }
}

class structure_node : public structure_element {
  set<symbol *> some_attributes;
  vector<structure_element *> children;
}

Solution

  • In general, the exceptional cases can implement load_construct_data (obviously this implies you don't rely on the default implementation as you already observed in your question).

    More specifically: use Boost Flyweight. Or look at how they implemented serialization for inspiration.

    Without a concrete sample program I'm unable to demonstrate things for you.

    Filling in some of the blanks, here's a demo program that should give a feel for things:

    #include <iostream>
    #include <boost/archive/text_oarchive.hpp>
    #include <boost/serialization/string.hpp>
    #include <boost/serialization/vector.hpp>
    #include <boost/serialization/set.hpp>
    #include <boost/serialization/map.hpp>
    
    #if 0
    #   define DEMO_FLYWEIGHT
    #   include <boost/flyweight/serialize.hpp>
    #   include <boost/flyweight.hpp>
    #endif
    
    struct structure_element { 
        virtual ~structure_element() {}
    
      private:
        friend class boost::serialization::access;
        template <typename Ar> void serialize(Ar& /*ar*/, unsigned /*version*/) {
        }
    };
    
    namespace detail {
        struct symbol_impl {
            symbol_impl(std::string const& x) : _x(x) { }
    
    #ifdef DEMO_FLYWEIGHT
            size_t hash() const { return boost::hash_value(_x); }
            //bool operator< (symbol_impl const& other) const { return _x <  other._x; }
            bool operator==(symbol_impl const& other) const { return _x == other._x; }
    #endif
    
          private:
            std::string _x;
    
            friend class boost::serialization::access;
            template <typename Ar> void serialize(Ar& ar, unsigned /*version*/) {
                ar & _x;
            }
        };
    }
    
    #ifdef DEMO_FLYWEIGHT
    namespace boost {
        template <> struct hash<::detail::symbol_impl> {
            size_t operator()(::detail::symbol_impl const& s) const { return s.hash(); }
        };
    }
    #endif
    
    struct symbol : public structure_element {
        symbol(std::string const& x) : _impl(x) {}
    
      private:
    #ifdef DEMO_FLYWEIGHT
        boost::flyweight<detail::symbol_impl> _impl;
    #else
        detail::symbol_impl _impl;
    #endif
    
        friend class boost::serialization::access;
        template <typename Ar> void serialize(Ar& ar, unsigned /*version*/) {
            ar & boost::serialization::base_object<structure_element>(*this);
            ar & _impl;
        }
    };
    
    struct structure_node : public structure_element {
        structure_node(std::set<symbol*> a, std::vector<structure_element*> c) 
            : some_attributes(std::move(a)), children(std::move(c))
        {
        }
    
        // TODO value semantics/ownership
      private:
        std::set<symbol *> some_attributes;
        std::vector<structure_element *> children;
    
        friend class boost::serialization::access;
        template <typename Ar> void serialize(Ar& ar, unsigned /*version*/) {
            ar & boost::serialization::base_object<structure_element>(*this);
            ar & some_attributes;
            ar & children;
        }
    };
    
    #include <boost/make_shared.hpp>
    
    int main() {
        // everything is leaked, by design
        symbol* bar = new symbol("bar");
    
        structure_node data { 
            {
                new symbol("foo"),
                bar,
                new symbol("foo"),
                new symbol("foo"),
                bar,
            },
            { 
                bar,
            }
        };
    
        boost::archive::text_oarchive oa(std::cout);
        oa << data;
    }
    

    Notes:

    • Live On Coliru without flyweight

      22 serialization::archive 11 0 0 1 0
      0 0 0 4 0 3 1 0
      1
      2 0 0 3 bar 3
      3
      4 3 foo 3
      5
      6 3 foo 3
      7
      8 3 foo 0 0 1 0 3 1
      
    • Live On Coliru with flyweight enabled

      22 serialization::archive 11 0 0 1 0
      0 0 0 4 0 3 1 0
      1
      2 0 0 0 0 0 3 bar 3
      3
      4 1 3 foo 3
      5
      6 1 3
      7
      8 1 0 0 1 0 3 1
      

    Note how objects are already tracked when serializing through pointer. This implies that no duplicates are serialized even when not using flyweight, see e.g. the bar object being used 3 times.

    For the foo object, you can see that it's implementation is "deduplicated" if you will when using flyweight.

    Boost Flyweight is highly configurable and can be made to perform significantly better than the default. I refer to the library documentation if you want to learn more