Search code examples
c++serializationboostgraphstack-overflow

How to prevent stack overflow when serializing a recursive graph structure with boost::serialization?


I am trying to serialize a large (geometric) graph structure with the boost serialization library.

I store my graph as an adjacency list, that is, my structure is as follows:

class Node {
  double x,y;
  std::vector<Node*> adjacent_nodes;
  ...
}

class Graph {
  std::vector<Node*> nodes;
  ...
}

Now with > 10k nodes my problem is, that when I start to serialize (save) my graph, it will recursively call the serialization of all of those nodes before returning, since the graph is connected.

To be more precise, when serializing the Graph it will start by serializing the first node in the "nodes" vector. While doing so it needs to serialize "adjacent_nodes" of the first nodes, where e.g. the second node is contained.

Therefore it needs to serialize the second node before returning the serialization of the first node and so on.

I found this thread from 2010, where someone explained the exact same problem. However, they did not come to a working solution there.

Any help would be greatly appreciated.

My structure in more detail:

class Node {

    double x,y;
    std::vector<Node*> adjacent_nodes;

public:

    inline double get_x() const { return x; }
    inline double get_y() const { return y; }
    inline std::vector<Node*> const& get_adjacent_nodes() const { return adjacent_nodes; }

    Node (double x, double y):x(x),y(y) {}

    void add_adjacent(Node* other) {
        adjacent_nodes.push_back(other);
    }

private:

    Node() {}

  friend class boost::serialization::access;
  template <class Archive>
  void serialize(Archive &ar, const unsigned int) {
    ar & x;
        ar & y;
        ar & adjacent_nodes;
  }

};

class Simple_graph {

std::vector<Node*> nodes;

void add_edge(int firstIndex, int secondIndex) {
    nodes[firstIndex]->add_adjacent(nodes[secondIndex]);
    nodes[secondIndex]->add_adjacent(nodes[firstIndex]);
}

public:

/* methods to get the distance of points, to read in the nodes, and to generate edges */

~Simple_graph() {
    for (auto node: nodes) {
        delete node;
    }
}

private:

  friend class boost::serialization::access;
  template <class Archive>
  void serialize(Archive &ar, const unsigned int) {
    ar & nodes;
  }

};

Edit: To add some suggestions made in the above mentioned thread, citing Dominique Devienne:

1) save all the nodes without their topology info on a first pass of the vector, thus recording all the "tracked" pointers for them, then write the topology info for each, since then you don't "recurse", you only write a "ref" to an already serialized pointer.

2) have the possibility to write a "weak reference" to a pointer, which only adds the pointer to the "tracking" map with a special flag saying it wasn't "really" written yet, such that writing the topology of a node that wasn't yet written is akin to "forward references" to those neighboring nodes. Either the node will really be written later on, or it never will, and I suppose serialization should handle that gracefully.

#1 doesn't require changes in boost serialization, but puts the onus on the client code. Especially since you have to "externally" save the neighbors, so it's no longer well encapsulated, and writing a subset of the surface's nodes become more complex.

#2 would require seeking ahead to read the actual object when encountering a forward reference, and furthermore a separate map to know where to seek for it. That may be incompatible with boost serialization (I confess to be mostly ignorant about it).

Can any of those proposals be implemented by now?


Solution

  • Since you already have a vector with pointers to all your nodes, you can serialize the adjacent_nodes vector using indexes instead of serializing the actual node data.

    You'll need to convert the this pointer to an index when serializing a node. This is simplest if you can store the node index in the node, otherwise you'll have to search thru nodes to find the right pointer (this process can be sped up by creating some sort of associative container to map the pointer to the index).

    When you need to read in the data, you can create your initial nodes vector filled with pointers to empty/dummy nodes (which will get populated when they are serialized).

    If that's not feasible, you can load the node indexes into a temporary array, then go back and populate the pointers once all the nodes have been read in. But you won't have to seek or re-read any parts of your file.