python graph-theory connected-components

Working out dependent groups

Using the following function I can generate some test data.

import random, string
a = list(string.ascii_lowercase)
def gen_test_data():
    s = []
    for i in xrange(15):
        p = random.randint(1,3)
        xs = [random.choice(a) for i in xrange(p)]
        s.append(xs)
    return s

This is my test data.

[
    ['a', 's'],
    ['f', 'c'],
    ['w', 'z'],
    ['z', 'p'],
    ['z', 'u', 'g'],
    ['v', 'q', 'w'],
    ['y', 'w'],
    ['d', 'x', 'i'],
    ['l', 'f', 's'],
    ['z', 'g'],
    ['h', 'x', 'k'],
    ['b'],
    ['t'],
    ['s', 'd', 'c'],
    ['s', 'w', 'd']
]

If a letter shares the list with another letter it is dependent on that letter, and any of that letters other dependencies. For example

x is dependant on ['a','c','d','g','f','i','h','k','l','q','p','s','u','w','v','y', 'x','z']

These dependencies are also two way. k is dependent on everything including x

but x is not dependant on b or t. These can be placed in separate groups.

I need to divide the list into as MANY non dependant groups as possible.

Each group will be a set of all letters that the group depends on. Non dependent letters will be a set of one.

An example output to the one above is

[
    ['t'],
    ['b'],
    ['a','c','d','g','f','i','h','k','l','q','p','s','u','w','v','y', 'x','z']
]

I am trying to write a function to do this but can't figure out the right way to group everything correctly.

Solution

This is a classic connected components problem. There are a number of efficient linear-time or nearly-linear-time algorithms to solve it, such as with a graph search algorithm like depth-first search or with a union-find data structure.

For a search-based algorithm, you would set up a graph based on your input, with nodes in the input sublists connected, then run a search of the graph to find which nodes are reachable from each other. Graph libraries like NetworkX can handle most of the implementation for you, or you could write it yourself. For example,

import collections

def build_graph(data):
    graph = collections.defaultdict(list)
    for sublist in data:
        # It's enough to connect each sublist element to the first element.
        # No need to connect each sublist element to each other sublist element.
        for item in sublist[1:]:
            graph[sublist[0]].append(item)
            graph[item].append(sublist[0])
        if len(sublist) == 1:
            # Make sure we add nodes even if they don't have edges.
            graph.setdefault(sublist[0], [])
    return graph

def dfs_connected_component(graph, node):
    frontier = [node]
    visited = set()
    while frontier:
        frontier_node = frontier.pop()
        if frontier_node in visited:
            continue
        visited.add(frontier_node)
        frontier.extend(graph[frontier_node])
    return visited

def dependent_groups(data):
    graph = build_graph(data)
    components = []
    nodes_with_component_known = set()
    for node in graph:
        if node in nodes_with_component_known:
            continue
        component = dfs_connected_component(graph, node)
        components.append(component)
        nodes_with_component_known.update(component)
    return components

This would have runtime linear in the size of the input.

You could also use a union-find data structure. A union-find data structure associates items with sets, each set represented by a representative element. They support two operations: find, which finds the representative for an element, and union, which takes two elements and joins their sets into one set.

You can set up a union-find data structure, and for each sublist in your input, union each element of the sublist with the first element of the sublist. This will efficiently group all dependent elements without joining any independent elements together.

With the standard implementation of a union-find data structure as a disjoint-set forest with union by rank and path compression, the runtime would be essentially linear in the size of the input. It'd be slower by a factor of the inverse Ackermann function of the input, which is essentially constant for all practical input sizes.