I have a simple graph that has Nodes which represent duplicate record id in the below form
Duplicate Id, Original Id
A,B
B,C
C,D
X,Y
Y,Z
The directed graph looks like A -> B ->C ->D and I want CSV result that looks like below that has each Node with ultimate leaf node with no more outgoing edges
A,D
B,D
C,D
X,Z
Y,Z
The above is a simplistic scenario to explain the problem however actual data will have more complex scenarios like below where I have 24 Nodes from A to X with each node connect to other 23 having 24C2=276 directed edges and for each of the 24 nodes I require the ultimate node that has no more outgoing edges.
A,B
A,C
A,D
A,E
A,F
A,G
A,H
A,I
A,J
A,K
A,L
A,M
A,N
A,O
A,P
A,Q
A,R
A,S
A,T
A,U
A,V
A,W
A,X
B,C
B,D
B,E
B,F
B,G
B,H
B,I
B,J
B,K
B,L
B,M
B,N
B,O
B,P
B,Q
B,R
B,S
B,T
B,U
B,V
B,W
B,X
C,D
C,E
C,F
C,G
C,H
C,I
C,J
C,K
C,L
C,M
C,N
C,O
C,P
C,Q
C,R
C,S
C,T
C,U
C,V
C,W
C,X
D,E
D,F
D,G
D,H
D,I
D,J
D,K
D,L
D,M
D,N
D,O
D,P
D,Q
D,R
D,S
D,T
D,U
D,V
D,W
D,X
E,F
E,G
E,H
E,I
E,J
E,K
E,L
E,M
E,N
E,O
E,P
E,Q
E,R
E,S
E,T
E,U
E,V
E,W
E,X
F,G
F,H
F,I
F,J
F,K
F,L
F,M
F,N
F,O
F,P
F,Q
F,R
F,S
F,T
F,U
F,V
F,W
F,X
G,H
G,I
G,J
G,K
G,L
G,M
G,N
G,O
G,P
G,Q
G,R
G,S
G,T
G,U
G,V
G,W
G,X
H,I
H,J
H,K
H,L
H,M
H,N
H,O
H,P
H,Q
H,R
H,S
H,T
H,U
H,V
H,W
H,X
I,J
I,K
I,L
I,M
I,N
I,O
I,P
I,Q
I,R
I,S
I,T
I,U
I,V
I,W
I,X
J,K
J,L
J,M
J,N
J,O
J,P
J,Q
J,R
J,S
J,T
J,U
J,V
J,W
J,X
K,L
K,M
K,N
K,O
K,P
K,Q
K,R
K,S
K,T
K,U
K,V
K,W
K,X
L,M
L,N
L,O
L,P
L,Q
L,R
L,S
L,T
L,U
L,V
L,W
L,X
M,N
M,O
M,P
M,Q
M,R
M,S
M,T
M,U
M,V
M,W
M,X
N,O
N,P
N,Q
N,R
N,S
N,T
N,U
N,V
N,W
N,X
O,P
O,Q
O,R
O,S
O,T
O,U
O,V
O,W
O,X
P,Q
P,R
P,S
P,T
P,U
P,V
P,W
P,X
Q,R
Q,S
Q,T
Q,U
Q,V
Q,W
Q,X
R,S
R,T
R,U
R,V
R,W
R,X
S,T
S,U
S,V
S,W
S,X
T,U
T,V
T,W
T,X
U,V
U,W
U,X
V,W
V,X
W,X
Here is a graphical visualization using Neo4j -
For the above case each Node A to W will have X as the ultimate Leaf Node.
There will also be Cyclic loops which I need to avoid in the overall solution. It may be too much to have everything in one step however will appreciate guidance.
Update : 2020-10-15 Traversal Optimization need to optimize execution in finding Path from starting Node to Leaf Node
For the Data Scenario below, The result for Vertex A and B should be
A,G
A,H
B,G
B,H
The shortest path from A to G is A->C->E->G ; is it possible to suppress any further traversals from A to G when any 1 shortest path to a leaf node is found ? else it will lead to unwanted execution especially for larger clusters of connected Nodes. This steps needs to be repeated again from A to H whose path will be A->C->F->H and then prevent any further attempts to find paths between A and H.
Thanks
Based on what you have provided, the following graph can be built.
g.addV('A').as('a').
addV('B').as('b').
addV('C').as('c').
addV('D').as('d').
addV('X').as('x').
addV('Y').as('y').
addV('Z').as('z').
addE('dup').from('a').to('b').
addE('dup').from('b').to('c').
addE('dup').from('c').to('d').
addE('dup').from('x').to('y').
addE('dup').from('y').to('z').
iterate()
and the following query used to find the results
gremlin> g.V().
repeat(out().simplePath()).
until(__.not(out())).
path().
by(label).
local(
unfold().
union(
limit(1),
tail()).
fold())
==>[A,D]
==>[B,D]
==>[C,D]
==>[X,Z]
==>[Y,Z]
UPDATED Oct 11th 2020
Using the larger data set you provided, I tweaked the query a bit so that for each starting vertex only one path to a leaf is found. This runs very quickly. Without doing this, starting at say 'A', there are literally millions of unique paths that end up at 'X' which is why the previous query becomes so complex.
gremlin> g.V().
......1> local(
......2> repeat(out().simplePath()).
......3> until(__.not(out())).
......4> path().
......5> by(label).
......6> limit(1)).
......7> local(
......8> unfold().
......9> union(
.....10> limit(1),
.....11> tail()).
.....12> fold())
==>[A,X]
==>[B,X]
==>[C,X]
==>[D,X]
==>[E,X]
==>[F,X]
==>[G,X]
==>[H,X]
==>[I,X]
==>[J,X]
==>[K,X]
==>[L,X]
==>[M,X]
==>[N,X]
==>[O,X]
==>[P,X]
==>[Q,X]
==>[R,X]
==>[S,X]
==>[T,X]
==>[U,X]
==>[V,X]
==>[W,X]
Purely for interest the following query shows the highly connected fan out of the graph.
gremlin> g.V().group().by(label).by(local(out().label().order().fold())).unfold()
==>A=[B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>B=[C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>C=[D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>D=[E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>E=[F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>F=[G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>G=[H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>H=[I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>I=[J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>J=[K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>K=[L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>L=[M, N, O, P, Q, R, S, T, U, V, W, X]
==>M=[N, O, P, Q, R, S, T, U, V, W, X]
==>N=[O, P, Q, R, S, T, U, V, W, X]
==>O=[P, Q, R, S, T, U, V, W, X]
==>P=[Q, R, S, T, U, V, W, X]
==>Q=[R, S, T, U, V, W, X]
==>R=[S, T, U, V, W, X]
==>S=[T, U, V, W, X]
==>T=[U, V, W, X]
==>U=[V, W, X]
==>V=[W, X]
==>W=[X]
==>X=[]
and the counts (multiplying out those numbers gives a large number) which explains why finding all paths from 'A' is an expensive query. Note that the simplePath
step helps us by making sure we do not follow any cycles. The number of paths from any vertex to 'X' in this data set ends up being 2^(C-1) where C is the number in the list below for a given start.
gremlin> g.V().group().by(label).by(local(out().count())).unfold()
==>A=23
==>B=22
==>C=21
==>D=20
==>E=19
==>F=18
==>G=17
==>H=16
==>I=15
==>J=14
==>K=13
==>L=12
==>M=11
==>N=10
==>O=9
==>P=8
==>Q=7
==>R=6
==>S=5
==>T=4
==>U=3
==>V=2
==>W=1
==>X=0