gremlin graph-databases tinkerpop tinkerpop3 amazon-neptune

Gremlin query for getting the order for a "dependency graph" like graph

With the following graph:

g.addV("entity").property(id, "A")
 .addV("entity").property(id, "B")
 .addV("entity").property(id, "C")
 .addV("entity").property(id, "D")
 .addV("entity").property(id, "E")
 .addV("entity").property(id, "F")
 .addV("entity").property(id, "G")
 .addV("entity").property(id, "H")
 .addE("needs").from(g.V("A")).to(g.V("B"))
 .addE("needs").from(g.V("A")).to(g.V("C"))
 .addE("needs").from(g.V("A")).to(g.V("D"))
 .addE("needs").from(g.V("A")).to(g.V("E"))
 .addE("needs").from(g.V("C")).to(g.V("F"))
 .addE("needs").from(g.V("D")).to(g.V("F"))
 .addE("needs").from(g.V("E")).to(g.V("G"))
 .addE("needs").from(g.V("F")).to(g.V("H"))

One variant of a correct order would be

[[B, H, G], [F, E], [C, D], [A]]

or just

[B, H, G, F, E, C, D, A]

The order within a "group" (e.g. [B, H, G] vs [H, G, B]) is not important.

How do I retrieve the correct order to resolve these nodes if they were to live in a dependency graph?

My thinking of the algorithm would be along the lines of one of these two solutions:

A) Start at leaf nodes and traverse to parents

Retrieve all leaf vertices (no outgoing edges), append to result set, and put into current and resolved.
Loop through all current and see if all of their children (the vertices they reference to) have are contained within resolved. If they are, save the current vertex into resolved and append to result set. Save the vertex' parents into current.
If current has elements, go to 2, else finish.

B) Simplified start at leaf nodes but look through whole graph every iteration

Retrieve all leaf vertices (no outgoing edges), append to result set, and put into resolved.
Find all vertices that that are not contained within resolved and that do not have any referenced vertices within resolved. Append those to the result set. If there are no vertices, return result set, else repeat 2.

Is this possible? If so, how? Can I use these current and resolved collections while traversing?

Solution

In a large graph this is not likely to be an efficient query but here is one solution. Note that the list after the initial set is in the reverse order from your question as that is the order in which those vertices are encountered. A more efficient approach would be perhaps to reverse the "tree" and to start from known roots rather than "all leafs".

gremlin> g.V().where(__.not(out())).store('a').
......1>       repeat(__.in()).emit().
......2>       until(__.not(__.in())).
......3>       dedup().
......4>       union(cap('a'),identity()).
......5>       fold()    

==>[[v[A1],v[A4]],v[A0],v[A3],v[A2]]

UPDATED

Given your new data set I had to take a different approach. My prior query was using a breadth first search but that does not yield the ordering that you need. I changed the approach to use the index step and for each vertex found only keep the one found with the highest index value in a path. The resultant list is the one you need, just in reverse order. We can also add to the query to reverse the list - shown further below.

gremlin>     g.V().where(__.not(out())).
......1>     repeat(__.in()).
......2>     until(__.not(__.in())).path().index().unfold().
......3>     order().by(tail(local),desc).limit(local,1).dedup().fold()

==>[v[A],v[C],v[D],v[E],v[F],v[B],v[G],v[H]]

If you want to reverse the list as part of the query you can also do that.

gremlin> g.V().where(__.not(out())).
......1> repeat(__.in()).
......2> until(__.not(__.in())).path().index().unfold().
......3> order().by(tail(local),desc).limit(local,1).dedup().fold().index().
......4> unfold().order().by(tail(local),desc).limit(local,1).fold()

==>[v[H],v[G],v[B],v[F],v[E],v[D],v[C],v[A]]