Search code examples
mysqlgraphneo4jspark-graphx

Find all possible paths between two nodes on graph using a graph database


I have a collection of nodes that make up a DAG (directed acyclic graph) with no loops guaranteed. I want to store the nodes in a database and have the database execute a search that shows me all paths between two nodes.

For example, you could think that I have the git history of a complex project.

Each node can be described with a JSON object that has:

  {'id':'id',
   'outbound':['id1','id2','id3']}
  }

So if I had these nodes in the database:

  {'id':'id0',
   'outbound':['id1','id2']}
  }

  {'id':'id1',
   'outbound':['id2','id3','id4','id5,'id6']}
  }

  {'id':'id2',
   'outbound':['id2','id3'}
  }

And if I wanted to know all of the paths connecting id0 and id3, I would want to get three lists:

   id0 -> id1 -> id3
   id0 -> id2 -> id3
   id0 -> id1 -> id2 -> id3

I have thousands of these nodes today, I will have tens of thousands of them tomorrow. However, there are many DAGs in the database, and the typical DAG only has 5-10 nodes, so this problem is tractable.

I believe that there is no way to do this efficiently MySQL (right now all of the objects are stored in a table in a JSON column), however I believe that it is possible to do it efficiently in a graph database like Neo4j.

I've looked at the Neo4J documentation on Path Finding Algorithms and perhaps I'm confused, but the examples don't really look like working examples. I found a MySQL example which uses stored procedures and it doesn't look like it parallelizes very well. I'm not even sure what Amazon Neptune is doing; I think that it is using Spark GraphX.

I'm sort of lost as to where to start on this.


Solution

  • It's perfectly doable with Neo4j.

    Importing json data

    [
      {"id":"id0",
       "outbound":["id1","id2"]
      },
      {"id":"id1",
       "outbound":["id2","id3","id4","id5","id6"]
      },
      {"id":"id2",
       "outbound":["id2","id3"]
      }
    ]
    
    CALL apoc.load.json("graph.json") 
    YIELD value
    MERGE (n:Node {id: value.id})
    WITH n, value.outbound AS outbound
    UNWIND outbound AS o
    MERGE (n2:Node {id: o}) 
    MERGE (n)-[:Edge]->(n2)
    

    enter image description here

    Apparently the data you provided is not acyclic...

    Getting all paths between two nodes

    As you are not mentioning shortest paths, but all paths, there is no specific algorithm required:

    MATCH p=(:Node {id: "id0"})-[:Edge*]->(:Node {id: "id3"}) RETURN nodes(p)
    
    "[{""id"":id0},{""id"":id1},{""id"":id3}]"
    "[{""id"":id0},{""id"":id2},{""id"":id3}]"
    "[{""id"":id0},{""id"":id1},{""id"":id2},{""id"":id3}]"
    "[{""id"":id0},{""id"":id2},{""id"":id2},{""id"":id3}]"
    "[{""id"":id0},{""id"":id1},{""id"":id2},{""id"":id2},{""id"":id3}]"
    

    Comparaison with MySql

    See how-much-faster-is-a-graph-database-really