Search code examples
pythongraphnetworkxinference

Efficient storage and use of large sets of graphs in Python w/ networkX, the aim being inference across the graphs?


The goal is as follows: I have the set of nodes, size N, for which I would like to

  • generate (as an iterable) a set of all possible directed graphs (presumably using networkx.DiGraph() and numpy.arange(0,N))(edit: to clarify — this is a very large number of graphs for even small N, so some kind of iterable structure is needed which allows the set to be constructed on the fly)
  • filter this set of graphs based on their meeting some criteria (e.g., the existence of a complete path from node 0 to node N-1)
  • store parameter values associated with each edge
  • compute functions across the filtered sets of graphs, taking into account the edge-associated parameter values in those functions without needing to have all the graphs in memory at the same time (as that would be really taxing without parallelization, and I don't know how to handle parallelization yet)

It seems like there has been plenty of discussion of dealing efficiently with graphs with many nodes and edges (i.e., large graphs) but very little discussion of how to effectively handle many graphs at once (large sets of graphs) where each graph can be expected to not overtax the system.

Edit: Ideally this would also work well with pandas and numpy arrays, but I'm pretty sure that any solution to the above can also be made to work with these since fundamentally networkX is working over dictionaries of dictionaries.


Solution

  • You write that you'd like to generate all possible directed graphs with N nodes. Unfortunately, this isn't feasible even for small N.

    Given a set of N nodes, the number of possible undirected edges is N(N-1)/2. We can define a graph by selecting a subset of these edges. There are 2^(N*(N-1)/2) possible subsets, meaning that there are exactly that many possible undirected graphs on N nodes.

    Suppose N=10. There are roughly 3.5 * 10^13 possible graphs on these nodes. If you could handle one million graphs per second, you'd need roughly 10^7 seconds handle all of the graphs. This is on the order of a year.

    And that's just undirected graphs. There are more directed graphs. For N nodes, there are 2^(N*(N-1)) directed graphs. Here's a table that demonstrates how fast this blows up. |V| is the number of nodes:

    |V|   Number of Digraphs
    ===   ==================
    1     1
    2     4
    3     64
    4     4096
    5     1048576
    6     1073741824
    7     4398046511104
    8     72057594037927936
    9     4722366482869645213696
    

    If you'd really like to do this, here are python generators which will lazily enumerate graphs on a node set:

    from itertools import chain
    import networkx as nx
    
    def power_set(iterable):
        """Return an iterator over the power set of the finite iterable."""
        s = list(iterable)
        return chain.from_iterable(combinations(s, n) for n in xrange(len(s) + 1))
    
    def enumerate_graphs(nodes):
        # create a list of possible edges
        possible_edges = list(combinations(nodes, 2))
    
        # create a graph for each possible set of edges
        for edge_set in power_set(possible_edges):
            g = nx.Graph()
            g.add_nodes_from(nodes)
            g.add_edges_from(edge_set)
            yield g
    
    def enumerate_digraphs(nodes):
        # create a list of possible edges
        possible_edges = list(combinations(nodes, 2))
    
        # generate each edge set
        for edge_set in power_set(possible_edges):
            # for each set of `M` edges there are `M^2` directed graphs
            for swaps in power_set(xrange(len(edge_set))):
                directed_edge_set = list(edge_set)
                for swap in swaps:
                    u,v = directed_edge_set[swap]
                    directed_edge_set[swap] = v,u
                g = nx.DiGraph()
                g.add_nodes_from(nodes)
                g.add_edges_from(directed_edge_set)
                yield g
    

    We can then plot all of the directed graphs thusly:

    nodes = ("apples", "pears", "oranges")
    
    digraphs = list(enumerate_digraphs(nodes))
    layout = nx.random_layout(digraphs[0])
    
    plt.figure(figsize=(20,30))
    for i,g in enumerate(digraphs):
        plt.subplot(6,5,i+1)
        nx.draw_networkx(g, pos=layout)
        plt.gca().get_xaxis().set_visible(False)
        plt.gca().get_yaxis().set_visible(False)
    

    Enumeration of digraphs