Search code examples
c++cgraphbigdatabdd

Representation of Large Graph with 100 million nodes in C++


I am dealing with a very large graph with 500 million number of nodes and the average degree of nodes is 100. So it is kind of sparse graph. I also have to store the weightage of each edge. I am currently using two vectors for that like following

// V could be 100 million
vector<int> *AdjList = new vector<int>[V];
vector<int> *Weight  = new vector<int>[V];

Using vector of vector does not seem to be space efficient. It takes more than 400gb of storage. Is there any better space efficient way to store this large graph in memory? Any suggestion of using any c++ library?


Solution

  • Preliminary remarks

    You could think of using vectors of vectors instead of using dynamic memory allocation:

    vector<vector<int>> AdjList(V);
    

    In any case, you'll have V different vector<int> in your adjacency list. Every vector needs some space overhead to manage the size and the location of its items. Unfortunately you double this overhead (and associated hidden memory management when adding new links) by keeping weight in a different vector/array.

    So why not regroup the adjacency list and the weight ?

    struct Link {  
       int target;   // node number that was in adj list.  Hope none is negative!!
       int weight;   
    };
    vector<vector<Link>> AdjList(V);
    

    Is the structure sparse ?

    If the big majority of nodes has some kind of link, this is quite fine.

    If on contrary, many nodes don't have an outgoing link (or if you have large unused node id ranges) then you could consider:

    map<int, vector<Link>> AdjList;  
    

    The map is an associative array. There would be only vectors for nodes that have outgoing links. By the way, you could use any numbering scheme you want for your nodes, even negative ones.

    You could even go a step further, and use a double map. The first map gives you the outgoing nodes. The second map maps the target node to the weight:

    map<int, map<int, int>> Oulala; 
    

    But this risks to be much more memory intensive.

    Big volumes ?

    map and vector manage memory dynamically using a default allocator. But you have lots of small objects of predetermined size. So you could consider using your own allocator. This could optimize memory management overhead significantly.

    Also, if you use vectors, when you load the adjacency list of a new node, it could be efficient to immediately reserve the size for the vector (if you know it). This could avoid several successive reallocations for the vector's growth. With millions of node this could be very expensive.

    Libraries ?

    The search for third party libraries is out of scope on SO. But if the above tips are not sufficient, you could consider using an existing graph library such as for example:

    There are a couple of other graph libraries around, but many seem either no longer maintained or not designed for big volumes.