Search code examples
algorithmgraph-theorydistributed-computingbreadth-first-searchdistributed-system

Distributed Graph Search


Given a huge graph that is partitioned across several nodes. Each node contains some partition of the set of vertices and global adjacency information.

What is the best way to implement to a BFS over this distributed graph, given a source vertex and the address of the node it resides on. The solution should be reliable, and fast.


Solution

  • There are a lot of ways to make this work. Here is a simple approach, that leverages https://en.wikipedia.org/wiki/MapReduce.

    I will assume that there are three pools of machines available to you.

    1. Your graph database
    2. A distributed key/value store (eg Cassandra)
    3. map/reduce workers (eg Hadoop)

    Then the algorithm proceeds as follows:

    Insert into your kv store the fact that you reached the starting vertex at `(0, Null)` where the pair is `(generation, from)`:
    while not finished and last generation found new stuff:
         Do a mapreduce from your k/v store:
           map gets (vertex, (found_generation, from_vertex):
             - sends:
                if found_generation is current_generation:
                    foreach new_vertex adjacent to vertex (lookup in graph db):
                        send: new_vertex: (current_generation + 1, vertex)
                else:
                    send: vertex: (found_generation, from_vertex)
           reduce gets (vertex, (found1, v1), (found2, v2), ...)
               search list for sample record with minimum found
                   store minimum found back in k/v store
    if found target:
       recursively walk k/v store to reconstruct the path
    clear k/v store of working set
    return answer
    

    The key is that all lookups to/from the graph and the k/v store are distributed, and all work within the map/reduce is also distributed. The synchronization per generation is minimal. Therefore this will do most of the work in a distributed way.

    And a performance note. In general going from a simple implementation on a single machine to a distributed machine is an order of magnitude increase in cost and resources, followed by tremendous scalability. Going from a simple implementation to a well-optimized one tends to be 1-2 orders of magnitude improvements in performance, but you're limited to what one machine can do. Pick carefully which approach to take. Only be distributed if you really need to be.