Search code examples
algorithmgraphtime-complexityconnected-components

How to find connected components in graph efficiently without adjacency matrix?


I want to find the connected components in an undirected graph. However, I don't have an adjacency matrix. Instead I have a set of vertices as well as a function telling me whether two vertices are adjacent. What is the most efficient way to find all connected components?

I know I could just calculate the entire adjacency matrix and use depth-first search to find all components. But that would not be very efficient as I'd need to check every pair of vertices.

What I'm currently doing is the following procedure:

  1. Pick any unassigned vertex which is now its own component
  2. Find all neighbors of that vertex and add them to the component
  3. Find all neighbors of the just added vertices (amongst those vertices not yet assigned to any component) and add them too
  4. Repeat previous step until no new neighbors can be found
  5. The component is now complete, repeat from the first step to find other components until all vertices are assigned

This is the pseudocode:

connected_components(vertices):
  // vertices belonging to the current component and whose neighbors have not yet been added
  vertices_to_check= [vertices.pop()]
  // vertices belonging to the current component
  current_component = []
  components = []
  while (vertices.notEmpty() or vertices_to_check.notEmpty()):
    if (vertices_to_check.isEmpty()): // All vertices in the current component have been found
      components.add(current_component)
      current_component = []
      vertices_to_check= [vertices.pop()]
    next_vertex = vertices_to_check.pop()
    current_component.add(next_vertex)
    for vertex in vertices: // Find all neighbors of next_vertex
      if (vertices_adjacent(vertex, next_vertex)):
        vertices.remove(vertex)
        vertices_to_check.add(vertex)
  components.add(current_component)
  return components

I understand that this method is faster than calculating the adjacency matrix in most cases, as I don't need to check whether two vertices are adjacent, if it is already known they belong to the same component. But is there a way to improve this algorithm?


Solution

  • Ultimately, any algorithm will have to call vertices_adjacent for every single pair of vertices that turn out to belong to separate components, because otherwise it will never be able to verify that there's no link between those components.

    Now, if a majority of vertices all belong to a single component, then there may not be too many such pairs; but unless you expect a majority of vertices all belong to a single component, there's little point optimizing specifically for that case. So, discarding that case, the very best-case scenario is:

    • There turn out to be exactly two components, each with the same number of vertices (½|V| each). So there are ¼|V|2 pairs of vertices that belong to separate components, and you need to call vertices_adjacent for each of those pairs.
    • These two components turn out to be complete, or you turn out to be exceptionally lucky in your choice of edges to check for first, such that you can detect the connected parts by checking just |V| − 2 pairs.

    . . . which still involves making ¼|V|2 + |V| − 2 calls to vertices_adjacent. By comparison, the build-an-adjacency-list approach makes ½|V|2 − ½|V| calls — which is more than the best-case scenario, but by a factor of less than 2. (And the worst-case scenario is simply equivalent to the build-an-adjacency-list approach. That would happen if no component contains more than two vertices, or if the graph is acyclic and you get unlikely in your choice of edges to check for first. Most graphs will be somewhere in between.)

    So it's probably not worth trying to optimize too closely for the exact minimum number of calls to vertices_adjacent.

    That said, your approach seems pretty reasonable to me; it doesn't make any calls to vertices_adjacent that are clearly unnecessary, so the only improvement would be a probabilistic one, if it could do a better job guessing which calls will turn out to be useful for eliminating later calls.

    One possibility: in many graphs, there are some vertices that have a lot of neighbors and some vertices that have relatively few, according to a power-law distribution. So if you prioritize vertices based on how many neighbors they're already known to have, you may be able to take advantage of that pattern. (I think this is especially likely to be useful if the majority of vertices really do all belong to a single component, which is the only case where a better-than-factor-of-2 improvement is even conceivable.) But, you'll have to test to see if it actually makes a difference for the graphs you're interested in.