Search code examples
algorithmgraphneo4jconnectivity

Find Minimum Vertex Connected Sub-graph


First of all, I have to admit I'm not good at graph theory.

I have a weakly connected directed graph G=(V,E) where V is about 16 millions and E is about 180 millions.

For a given set S, which is a subset of V (size of S will be around 30), is it possible to find a weakly connected sub-graph G'=(V',E') where S is a subset of V' but try to keep the number of V' and E' as small as possible?

The graph G may change and I hope there's a way to find the sub-graph in real time. (When a process is writing into G, G will be locked, so don't worry about G get changed when your sub-graph calculation is still running.)

My current solution is find the shortest path for each pair of vertex in S and merge those paths to get the sub-graph. The result is OK but the running time is pretty expensive.

Is there a better way to solve this problem?


Solution

  • If you're happy with the results from your current approach, then it's certainly possible to do at least as well a lot faster:

    Assign each vertex in S to a set in a disjoint set data structure: https://en.wikipedia.org/wiki/Disjoint-set_data_structure. Then:

    • Do a breadth-first-search of the graph, starting with S as the root set.
    • When you the search discovers a new vertex, remember its predecessor and assign it to the same set as its predecessor.
    • When you discover an edge that connects two sets, merge the sets and follow the predecessor links to add the connecting path to G'

    Another way to think about doing exactly the same thing:

    1. Sort all the edges in E according to their distance from S. You can use BFS discovery order for this
    2. Use Kruskal's algorithm to generate a spanning tree for G, processing the edges in that order (https://en.wikipedia.org/wiki/Kruskal%27s_algorithm)
    3. Pick a root in S, and remove any subtrees that don't contain a member of S. When you're done, every leaf will be in S.

    This will not necessarily find the smallest possible subgraph, but it will minimize its maximum distance from S.