Fast way to find all connected subgraphs of given size in Python?

[Note: A fast solution was given in an answer however further improvement concerning speed is desirable.]

Given is an undirected sparsely connected graph G with n vertices. I am looking for a fast way to find all connected subgraphs of G with m vertices. It can be assumed that m ≪ n and that the degree of the vertices of G is deg(v)≪n.

Graphical example

For the example we have n=5, m=3.

The 4 connected subgraphs are: (0,1,2),(1,2,3),(0,2,3),(2,3,4)

Code example

Following code using networkx and ego_graph is quite slow. It takes about 100 seconds for a graph with n=100, deg(v)=10, m=4. It is an example for m ≪ n and deg(v)≪n. The case n=10000 would take eternal time.

import networkx as nx
import itertools
import time

n=100   #nodes in graph
deg=10  #node degree in graph
m=4     #nodes in subgraph

graph=nx.gnm_random_graph(n,n*deg,seed=1)#construct random graph

starttime=time.time()
G = graph.copy()
all_connected_subgraphs = []
radius=m-1
for n in graph.nodes():
    egoG = nx.ego_graph(G,n,radius=radius,center=False)# all closeby nodes
    for sn in itertools.combinations(egoG, radius):# test all combinations of closeby nodes
        SG=[n,*sn]
        G_sub=G.subgraph(SG)
        if nx.is_connected(G_sub):
            all_connected_subgraphs.append(SG)
    G.remove_node(n)

endtime=time.time()
print(endtime-starttime)

The slow steps are the lines G_sub=G.subgraph(SG) and if nx.is_connected(G_sub): that consume 20% and 80% of total calculation time, respectively.

In a related question no efficient solution was given.

Solution

Here's a solution using a recursive generator function: I didn't use NetworkX, so instead the graph is represented as a list of sets of neighbours.

def all_connected_subgraphs(g, m):
    def _recurse(t, possible, excluded):
        if len(t) == m:
            yield t
        else:
            excluded = set(excluded)
            for i in possible:
                if i not in excluded:
                    new_t = (*t, i)
                    new_possible = possible | g[i]
                    excluded.add(i)
                    yield from _recurse(new_t, new_possible, excluded)
    excluded = set()
    for (i, possible) in enumerate(g):
        excluded.add(i)
        yield from _recurse((i,), possible, excluded)

It's a backtracking algorithm which says, given a partial subgraph of size < m and a set of nodes that are allowed to be added, add each node and expand the set of possible options to include all nodes reachable from that node. To reduce symmetry, we keep track of which nodes have already been used and add these to the excluded set.

Demonstration using your example graph:

>>> example_graph = [{1, 2}, {0, 2}, {0, 1, 3}, {2, 4}, {3}]
>>> list(all_connected_subgraphs(example_graph, 3))
[(0, 1, 2), (0, 2, 3), (1, 2, 3), (2, 3, 4)]

Here's the code I used to generate a random graph, which should be equivalent to the G(n, m) distribution you're sampling from:

import random
import itertools as it

def random_graph(n, num_edges):
    adj_sets = [set() for i in range(n)]
    all_edges = list(it.combinations(range(n), 2))
    random.shuffle(all_edges)
    for (i, j) in all_edges[:num_edges]:
        adj_sets[i].add(j)
        adj_sets[j].add(i)
    return adj_sets

The running time on my laptop is under half a second to find all connected subgraphs of size 4 from a graph with 100 nodes and 1,000 edges, like your benchmark:

>>> g = random_graph(100, 1000)
>>> import timeit
>>> timeit.timeit(lambda: list(all_connected_subgraphs(g, 4)), number=1)
0.42606337900360813

Note that list is needed here to exhaust the generator, otherwise the algorithm won't actually generate any subgraphs. If you just want to iterate over all connected subgraphs with a for loop then you don't need to put them in a list first.

There are various "easy" ways to optimise this, e.g. the set of edges for each node i can be filtered first to remove "back-edges" to nodes less than i, and the recursion could be replaced with a stack to make an iterative algorithm that doesn't use yield (which can have a relatively large performance overhead). But those optimisations are probably not needed unless your inputs need to be considerably bigger or denser.