Search code examples
pythonpython-3.xgraphnetworkxgraph-theory

How to efficiently find pairs of nodes in a graph that when removed will cause the graph to split?


Consider this simple biconnected graph (graph without articulation points):

import networkx as nx

G = nx.diamond_graph()
nx.draw(G, with_labels=True)

diamond graph

I want to find out what two pairs of nodes when removed from the graph will cause the graph to split. For example removing nodes 1 and 2 from this graph will cause the graph to have 2 components: node 3 and node 0). For this I am removing the combination of nodes one by one and counting the number of components left in the graph.

from itertools import combinations

breaking_combinations = []
combos = combinations(G.nodes(), 2)
for combo in combos:
    G_copy = G.copy()
    init_num_components = nx.number_connected_components(G_copy)
    G_copy.remove_nodes_from(combo)
    num_components_after_combo_removed = nx.number_connected_components(G_copy)
    if num_components_after_combo_removed > init_num_components:
        breaking_combinations.append(combo)

print(breaking_combinations)

The logic works fine and I am getting the expected results: (1, 2) in this example. But this process is so slow on big graphs (~10,000 nodes) that it takes forever to find all combinations in the graph.

My question is: is there an efficient alternate approach I can use to achieve the same results - perhaps by using in-built networkx algorithms?


Solution

  • To expand on my comment above:

    import networkx as nx
    
    G = nx.diamond_graph()
    
    solutions = []
    for removed_node in G:
        H = G.subgraph([node for node in G if node != removed_node]) # subgraph views are much faster to initialize than copies
        solutions.extend([(removed_node, node) for node in nx.articulation_points(H)])
    print(solutions)
    # (1, 2), (2, 1)
    

    Edit

    I also did a little bit of testing. The results are encouraging.

    import networkx as nx
    
    from itertools import combinations
    
    
    def find_articulation_pairs(G):
        solutions = []
        for removed_node in G:
            H = G.subgraph([node for node in G if node != removed_node]) # subgraph views are much faster to initialize than copies
            solutions.extend([(removed_node, node) for node in nx.articulation_points(H)])
        return solutions
    
    
    def find_articulation_pairs_baseline(G):
        # the original approach with minor improvements
        total_components = nx.number_connected_components(G) # no need to compute this number on each iteration
        solutions = []
        for removed_nodes in combinations(G, 2):
            H = G.subgraph([node for node in G if node not in removed_nodes])
            if nx.number_connected_components(H) > total_components:
                solutions.append(removed_nodes)
        return solutions
    
    
    def test_correctness():
        G = nx.diamond_graph()
        print(find_articulation_pairs(G))
        print(find_articulation_pairs_baseline(G))
    
    
    def test_running_time():
        import time
    
        # large random graph
        G = nx.erdos_renyi_graph(100, 0.1)
    
        # largest biconnected subgraph
        H = G.subgraph(max(nx.biconnected_components(G), key=len))
    
        tic = time.time()
        _ = find_articulation_pairs(H)
        toc = time.time()
        _ = find_articulation_pairs_baseline(H)
        tac = time.time()
    
        print(f"Baseline time: {tac-toc:.2f}")
        print(f"Improved time: {toc-tic:.2f}")
    
    
    if __name__ == "__main__":
    
        test_correctness()
        test_running_time()
    
    

    Running the script results in the following output:

    [(1, 2), (2, 1)]
    [(1, 2)]
    Baseline time: 14.72
    Improved time: 0.44
    

    Edit #2

    Both approaches are trivial to parallelize, so depending on the number of available cores, there is another speed-up of factor 4-12 to achieve.

    Some back of the envelope calculations suggest that for 10k nodes, the running time of my improved approach (on my machine without parallelization) would still be around 2 days, so such parallelization is probably not optional but a must.